Phenobase / phenobase_data

0 stars 0 forks source link

Questions on provided data structure #2

Closed jdeck88 closed 1 month ago

jdeck88 commented 1 month ago

The following are questions about particular column names provided in Daijiang's sample data:

Questions on provided column names  
file_name I'm not going to be tracking file_names locally. More important is observed_image_guid and observed_image_url only
photo_id Is this the same as the observed_image_guid?
observation_uuid is thisthe same as occurrenceID?
inat_URL not sure we have a designation for this?  What community standard vocabulary term is this?  Maybe also occurrenceID?
taxon_id do we need this?  Taxon_id according to who?
observed_on is this verbatim_date?
name scientific_name?
family_id family_id according to who? Not sure this is necessary.
count_family are we tracking this for the interface?  I didn't have this in the model.
prediction_prob ?
prediction_class ?
equivocal the field I am using is certainty which contains values "Equivicol" and "Unequivocal"
proportion_equivocal_family ?
accuracy_excluding_equivocal_family ?
accuracy_family ?
genus I thought we were just asking for scientific_name and family and not tracking genus specifically
   

@daijiang @rdinnager

robgur commented 1 month ago

John we need to talk it over on chat because there are issues here beyond what you have listed. Here is a clear listing of what each field means and whether we NEED it (REQUIRED) or not. Please, we need the fields listed as required and you should ask questions here re: what is confusing. One issue is name and rank, which is a little different from other data we've ingested.

(likely can be deleted) -- file_name - the photo file name; a concatenation of the photo_id and image type (.jpg or .jpeg) (maybe we can delete this field since it is linked directly to photo_id)

(REQUIRED) photo_id: The photo_id assigned by iNaturalist. There can be multiple photo_ids assigned to inaturalist observation ids i.e. observation_uuid.

(I think keep but see note) observation_uuid: iNaturalist's observation id. We can maybe delete this since it doesn't itself resolve to anything.

(REQUIRED) inat_URL: This is basically "https://www.inaturalist.org/observations/" + "observation_uuid". This will resolve to a record.

(REQUIRED) latitude: same definition as DwC:latitude (REQUIRED) longitude: same definition as Dwc:longitude

positional_accuracy: This is mapped to DwC:coordinateUncertaintyInMeters so we may as well use that mapping and definition too.

(likely delete) taxon_id: an iNaturalist internal identifier for a taxon. We can likely drop this field.

observed_on: Mapped to dwc:eventDate and may as well use that mapping here.

(REQUIRED) name: The taxonomic name of the identified occurrence to the lowest level so far obtained.

(REQIUIRED) rank: The taxonomic rank of the identification, often but not always species.

(likely delete) family_id: an iNaturalist internal identifier for family names. Can likely delete.

(REQUIRED) family: the name of occurrence at the taxon rank of family according to iNaturalist

(REQUIRED) count_family: The number of records that were annotated at the taxon rank of family associated with the identified occurrence

(REQUIRED) trait: the name of the trait being classified by ML model. For the first tranche of data, this is either "flower" or "fruit".

(REQUIRED) prediction_prob: "A score from 0-1 which represent a scaled probability of presence or absence of a trait"

(REQUIRED) prediction_class: a categorical description of the annotation outcome, in this case either "detected" or "not detected"

(REQUIRED) equivocal: A categorical description of whether the annotation was equivocal or unequivocal. For now all the data ingested into Phenobase is "unequivocal".

(REQUIRED) proportion_equivocal_family: The proportion of the number of equivocal annotations divided by all annotations for iNaturalist observations in the taxonomic family. A metric of how successful the annotation process for observations in that family.

(REQUIRED accuracy_excluding_equivocal_family: A metric of accuracy for the annotations of only unequivocal records based on expert, gold standard observations.

(REQUIREF) accuracy_family: A metric of accuracy for the annotations of all records (equivocal and unequivocal) based on expert, gold standard observations.

(REQUIRED) genus: the name of occurrence at the taxon rank of genus according to iNaturalist

jdeck88 commented 1 month ago

I updated columns.csv with the additional fields and added your definitions. Also added a column called required with a boolean value of TRUE if required and FALSE if not.

A few questions based on your comments:

robgur commented 1 month ago

can we please call the field inat_url observed_url? or at least something more generic... assuming there will be other data to ingest besides inat... -- YES SOUNDS GOOD

on the suggested field equivocal, lets call this field "certainty" - YES but running a quick check with RUSSEL, we might want to change the controlled vocab for that field to "certain" and "uncertain" not "uequivocal" and "unequivocal"

What event does eventDate (observed_on) refer to exactly? Date a photo was taken or date that we interpreted that photo - Date photo was taken. Some people load their photos many days or years later on iNat so we want the date it was observerd not uploaded.

do you have a GUID for each machine level observation - no let's put that into the datastore though - good idea. I can ask Russell to mint a GUID for each machine level observation.

jdeck88 commented 1 month ago

i updated https://github.com/Phenobase/phenobase_data/blob/main/data/columns.csv with the latest feedback.

Just when i'm getting used equivocal and unequivocal you want to change them??

anyway, let me know if any other fields/information needs to change on the field definition list before i close this comment.

robgur commented 1 month ago

Is model_uri an identifier for the model and individual_id effectively an identifier for the annotation record? Just trying to make sure I understand. I think all else looks ok here. And we can go with "equivocal" and "unequivocal" if you prefer. I just went with words that seemed simpler for users to grasp immediately

jdeck88 commented 1 month ago

yes on model_uri individual_id i think is if we're looking at different photos of the same plant so we can track the same plant across time ok, to change the wording, just that i've grown fond of sounding smarter than i am by saying things like "unequivocally"

On Fri, Jul 12, 2024 at 5:59 PM Rob @.***> wrote:

Is model_uri an identifier for the model and individual_id effectively an identifier for the annotation record? Just trying to make sure I understand. I think all else looks ok here. And we can go with "equivocal" and "unequivocal" if you prefer. I just went with words that seemed simpler for users to grasp immediately

— Reply to this email directly, view it on GitHub https://github.com/Phenobase/phenobase_data/issues/2#issuecomment-2226626485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3RORIBDDBL3RKXPQXS3ZMB3WNAVCNFSM6AAAAABKZHBV4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGYZDMNBYGU . You are receiving this because you authored the thread.Message ID: @.***>

-- John Deck (541) 914-4739

robgur commented 1 month ago

Works for me John. We might want to add a unique identifier to each record (eg row) as well, just to be safe. It seems to always be a good idea.

On Fri, Jul 12, 2024, 9:02 PM John Deck @.***> wrote:

yes on model_uri individual_id i think is if we're looking at different photos of the same plant so we can track the same plant across time ok, to change the wording, just that i've grown fond of sounding smarter than i am by saying things like "unequivocally"

On Fri, Jul 12, 2024 at 5:59 PM Rob @.***> wrote:

Is model_uri an identifier for the model and individual_id effectively an identifier for the annotation record? Just trying to make sure I understand. I think all else looks ok here. And we can go with "equivocal" and "unequivocal" if you prefer. I just went with words that seemed simpler for users to grasp immediately

— Reply to this email directly, view it on GitHub < https://github.com/Phenobase/phenobase_data/issues/2#issuecomment-2226626485>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIZ3RORIBDDBL3RKXPQXS3ZMB3WNAVCNFSM6AAAAABKZHBV4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGYZDMNBYGU>

. You are receiving this because you authored the thread.Message ID: @.***>

-- John Deck (541) 914-4739

— Reply to this email directly, view it on GitHub https://github.com/Phenobase/phenobase_data/issues/2#issuecomment-2226635199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADRZ3FZ7XDMHIH2U4AH4Z3ZMB4DHAVCNFSM6AAAAABKZHBV4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGYZTKMJZHE . You are receiving this because you commented.Message ID: @.***>

jdeck88 commented 1 month ago

closing this issue as we parsed this out into sub-issues

rdinnager commented 1 month ago

on the suggested field equivocal, lets call this field "certainty" - YES but running a quick check with RUSSEL, we might want to change the controlled vocab for that field to "certain" and "uncertain" not "uequivocal" and "unequivocal"

A bit late to this, but I'm okay with equivocal field being called 'certainty', but I'm not sure I am comfortable calling a model prediction 'certain'. What if the values for certainty were just 'high' and 'low'? Equivocal is low certainty, Unequivocal is high certainty.

I suppose we will also have to change the field 'proportion_equivocal_family' to 'proportion_low_certainty_family', and 'accuracy_excluding_equivocal_family' to 'accuracy_excluding_low_certainty_family'?

Maybe this should be a new issue?