Phenobase / phenobase_data

0 stars 0 forks source link

Add GUID field to iNat observation records. What does it mean? #4

Closed ramonawalls closed 1 month ago

ramonawalls commented 1 month ago

From email thread:

Re: GUID, RPG: I have a technical question that maybe Ramona can help with too. I would prefer the GUID to be defined as an annotation identifier. Basically an identifier linking to the outcome of an annotation process. I think this is better than an identifier linked to each row of our data, which mixes fields from iNat with annotation data to ease use.

I am not following the need here.

@jdeck88: can you provide some clarity on what you intended the GUID field to contain?

@robgur: When you say "an identifier linking to the outcome of an annotation process" what is it linked to?

jdeck88 commented 1 month ago

The GUID would be a unique identifier for the machine annotation event on the trait. So, each time we look at an image it would get a GUID that records this particular event. I don't think we need to be provided with a resolvable ID, it could in fact just be a UUID. Then the UUID can be resolved through our interface... this would be our link in the interface to view all the metadata surrounding the plant, the image, the source, and the machine interpretation process.

The other advantage here, and maybe biggest reason i asked for it is, to track what records have been loaded or not previously.... so, if i get a file with GUIDs that have already been loaded i can recognize this is an update operation over an insert operation.

robgur commented 1 month ago

Agree John.

On Thu, Jul 18, 2024 at 12:39 PM John Deck @.***> wrote:

The GUID would be a unique identifier for the machine annotation event on the trait. So, each time we look at an image it would get a GUID that records this particular event. I don't think we need to be provided with a resolvable ID, it could in fact just be a UUID. Then the UUID can be resolved through our interface... this would be our link in the interface to view all the metadata surrounding the plant, the image, the source, and the machine interpretation process.

— Reply to this email directly, view it on GitHub https://github.com/Phenobase/phenobase_data/issues/4#issuecomment-2237053102, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRZ3GXGB2LG5EAN5LLPOLZM7VUHAVCNFSM6AAAAABLB2NN46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZXGA2TGMJQGI . You are receiving this because you were mentioned.Message ID: @.***>

--

ramonawalls commented 1 month ago

Thanks, John. I agree about the importance of including this field.

Having the ID represent the instance of the planned process that assigns a phenological to a plant based on machine learning on the image of a plant makes sense, but I think it conflicts with "the UUID... would be our link in the interface to view all the metadata surrounding the plant, the image, the source, and the machine interpretation process." The latter sounds like a UUID for a unique record in our database. If there is a one to one correspondence between a unique record in our database and an instance of the planned process, then they can have the same UUID. We just need to be very clear about how we define the field, because that relationship could change in the future, e.g., if we add some other analysis process for each image, and there are multiple instances of that process for each ML process.

jdeck88 commented 1 month ago

OK,i just listened to the zoom recording from today's meeting! will keep my response succinct here.

Agreed that we need a UUID for the "planned process of annotating an image" and that this is distinct from the observation_url from inat and the photoID.

Also, i don't need any other record level identifier here.

rdinnager commented 1 month ago

Sorry if I am not understanding this but is the ID for the 'planned process of annotating an image' the same as a combination of the photo_id and the model_uri? That is, each unique combination of photo_id and model_uri, a photo that has been annotated with a particular model version?

Will this need to be generated by the machine annotation script? That is, it needs to be in the file (or json) for ingestion?

jdeck88 commented 1 month ago

@rdinnager it may not be that simple: what if there is a photo that has multiple plants on it, or you use the same model_uri to look at multiple traits? Maybe we need to refine this definition to be more explicit about what we mean by annotating an image? @ramonawalls or @robgur ?

robgur commented 1 month ago

@jdeck88 @rdinnager @ramonawalls, the identifier we are minting is described by Ramona the way I understand it: "the ID represent the instance of the planned process that assigns a phenological [annotation] to a [target] plant [occurrence/observation] based on machine learning on [an image containing the] plant". I added or modified slightly the words in brackets. I think the end result is that every annotation just has an annotation_id that can be a UUID. @jdeck88 because this is always tied back to a plant occurrence, I think it is ok to not worry about the complexities you raise above. There would be a different annotation_id for flower and fruit annotations of the same plant. In the case of multiple plants, there is a target plant defined by the observer which is the target of the ML process. If someone wants to post the same plant photo and target a different plant or insect, its really a different observation record.

rdinnager commented 1 month ago

Okay, I think I get it now. But, how do I generate such a unique idea during the annotation? This something I have not done before.

robgur commented 1 month ago

I think it may be as simple as making a new column called "annotation_id" and generating a UUID for each record in our annotation outputs that is placed in that field. Since each row is capturing annotation outputs, this should work fine.

On Wed, Jul 24, 2024, 9:28 AM Russell Dinnage @.***> wrote:

Okay, I think I get it now. But, how do I generate such a unique idea during the annotation? This something I have not done before.

— Reply to this email directly, view it on GitHub https://github.com/Phenobase/phenobase_data/issues/4#issuecomment-2247938282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRZ3GU3XRYD6O4G4FVIDDZN6TZDAVCNFSM6AAAAABLB2NN46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBXHEZTQMRYGI . You are receiving this because you were mentioned.Message ID: @.***>

rdinnager commented 1 month ago

I think it may be as simple as making a new column called "annotation_id" and generating a UUID for each record in our annotation outputs that is placed in that field. Since each row is capturing annotation outputs, this should work fine.

This sounds fine for the first set of records but I'm thinking more about when I start doing more annotation runs. I'll have to know what GUIDs are already in the database in order to generate new GUIDs, and I will need a way to figure out what the GUID for an existing photo annotation is if I need to update the record for some reason. So presumable I will need to query the API for this @jdeck88? That is, will the API have a method to generate new unique IDs? Perhaps if we are using sequential integers then it would just have to be able to return the maximum value currently in the database? And for updating, I presumably can just do a lookup on the photo_id, model_uri and trait combinations and get the existing GUID? This might all be standard stuff but I'm just new to it, so just trying to figure out what the normal conventions and best practices are.

robgur commented 1 month ago

@jdeck88 you may have the best perspective on the issues and edges here. From a pure informatics perspective, a new annotation run means new annotation_ids, even for the same photo/observation, but I do worry about the "consumer" perspective - it is likely that any time we re-run annotations, we are also going to remove the old annotation set so that the "best" annotations are presented to users and we don't have to worry about confusion re: multiple annotations from different runs associated with the same image.

rdinnager commented 1 month ago

So I am gathering that if I generate a UUID using an algorithm that is (nearly) guaranteed to generate a unique ID every time, then I don't need to know what IDs are already in use, because there is very little chance I will generate a new one that has already been used?

robgur commented 1 month ago

That is 100% correct.

On Thu, Jul 25, 2024 at 11:55 AM Russell Dinnage @.***> wrote:

So I am gathering that if I generate a UUID using an algorithm that is (nearly) guaranteed to generate a unique ID every time, then I don't need to know what IDs are already in use, because there is very little chance I will generate a new one that has already been used?

— Reply to this email directly, view it on GitHub https://github.com/Phenobase/phenobase_data/issues/4#issuecomment-2250777137, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRZ3CXR2PF63PYKK5PJILZOENYVAVCNFSM6AAAAABLB2NN46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQG43TOMJTG4 . You are receiving this because you were mentioned.Message ID: @.***>

--

ramonawalls commented 1 month ago

It seems we have reached consensus on what the GUID field represents. Given the constrained usage, I suggest slight tweaks to the label and definition:

label: machine_learning_annotation_id (If that is too many characters, maybe ml_annotation_id, but the latter is not as obvious to outsiders.)

definition: A globally unique identifier for an instance of a planned process (http://purl.obolibrary.org/obo/COB_0000082) that assigns a phenological trait annotation to a target plant observation based on machine learning on an image that contains an image of the plant or part of the plant.

I will create a PR on https://github.com/Phenobase/phenobase_data/blob/main/data/columns.csv with these changes and ask @robgur and @jdeck88 to review.

ramonawalls commented 1 month ago

I just discovered I can't do the PR, because I don't have an IRI for the new property. @jdeck88 can you create on on biscicol and update the dictionary?

jdeck88 commented 1 month ago

i updated the dictionary with the new definition. @rdinnager you can just change the column name annotation_id to machine_learning_annotation_id
I'm going to close this issue since it looks like we have it!