incorporating upper/lower counts and percentages in dataset mappings

edenny commented 9 months ago

In mapping the in situ datasets, I need some help with remembering our guidelines for including the lower/upper bounds along with the traits. I think a value is required and we did this for both the NPN and PEP. @jdeck88 can you point me again to where the mapping files for those are located so I can see what we did?

I think for both the NPN and PEP we used lower/upper percentages, but could we use lower/upper count instead of percent if it is more relevant?

If they don’t already exist, I’d like to create some guidelines/best practices for how to apply these (e.g. if the protocol says “less than 5%” then use 4 for an upper percent value).

jdeck88 commented 8 months ago

See: https://github.com/biocodellc/ppo-data-pipeline/blob/master/projects/npn/intensity_values.csv

edenny commented 8 months ago

@jdeck88 This is a mapping for NPN intensity measures which I am pretty sure we never implemented. We only imported the yes/no data. There are these files, but I think they are the just the mappings Kjell and I created:

https://github.com/biocodellc/ppo-data-pipeline/blob/master/projects/npn/phenophase_descriptions.csv https://github.com/biocodellc/ppo-data-pipeline/blob/master/projects/pep725/phenophase_descriptions.csv

I thought when we were in Daijiang's lab last April you showed me a file of the actual data, or at least a sample of it, with the corresponding trait AND the upper or lower value assigned for each record. There was one for NPN and one for PEP. That's what I was trying to find again. But it's quite possible I was misunderstanding/misremembering what I was looking at!

edenny commented 8 months ago

We discussed and clarified in the team meeting.... there are data properties for lower percent, upper percent, lower count, upper count and at least one of these needs to be populated. For absence, lower percent/count is "0". For presence, any value above 0 is converted to "1" in programming.

For this round of the Phenobase project we are still primarily concerned only with presence/absence, so we don't need to incorporate more precise lower/upper bounds. However, as I go through the effort of carefully mapping in situ datasets, I would like to include the lower/upper bounds reflected in the collection methods so that I (or someone else) does not need to go back and sift through the methods all over again.

To this end, I will add 4 fields to my mappings for lower percent, upper percent, lower count, upper count and fill them in as appropriate. In the process I will outline for myself a "best practices" in how to populate those fields. When John applies the mapping for ingesting the dataset in the near term, all bounds >0 will be converted to "1". But the precise values will be in place in the mapping if and when the time comes to incorporate them.

PlantPhenoOntology / ppo

incorporating upper/lower counts and percentages in dataset mappings #78