BioKIC / NEON-Biorepository

Development base for the NEON Biorepository Data Portal host by BioKIC - Arizona State University (https://biorepo.neonscience.org)
GNU General Public License v2.0
2 stars 1 forks source link

Taxa to sample relationships cannot be determined in microalgae samples #456

Open kyule opened 6 months ago

kyule commented 6 months ago

Similar to previously reported issues with the macroinvertebrate samples, due to how the data are published in the API, it is not possible for us to determine which taxa go with which samples.

For example, see the sample hierarchy associated with parent sample BLWA.20191028.EPIXYLON.7 (sample tree data from sample explorer available here: neon-samples (5).csv )

The samples that we have at the NEON Biorepository (or will have in the future) in this hierarchy are BLWA.20191028.EPIXYLON.7.TAXONOMY.FD (freeze dried), BLWA.20191028.EPIXYLON.7.TAXONOMY.PRES (chemically preserved), BLWA.20191028.EPIXYLON.7.TAXONOMY.SLIDE1 (slide), BLWA.20191028.EPIXYLON.7.TAXONOMY.SLIDE2 (slide).

The data are presented differently when there was a different contractor. Eg. see SUGG.20150714.EPIPHYTON.4 neon-samples (6).csv

I had thought that how this works is that the samples were split and some were freeze dried, some were chemically preserved, and some were placed on a slide. Then those on the slides were identified. Now, I'm wondering whether it may be that all of the taxa from the field sample are identified, then randomly distributed across the different samples (except I'm not sure how that works since only diatoms should be on the slides?). In which case, the intended strategy of associating IDs with the slides does not work.

Reporting here, but I'm guessing that this will take some meetings to work out with NEON.

kyule commented 4 months ago

Pasting in email conversation to keep info together here:

Kelsey:

The less great news is that in trying to test these developments, I've been looking more in depth at the microalgae samples in particular. It looks like we may have a similar issue as with the macroinvertebrates, where it is not necessarily possible to determine which taxa are actually associated with an individual sample. The interpretation is more difficult because the data seem to be slightly differently formatted between Drexel, Kociolek, and EcoAnalysts samples (e.g. are there taxa directly associated with slides at all -- it's possible there is an issue but it's not universal?). I've attached one example sample tree associated with Biorepo samples from each of the different contractors. It would be really helpful if someone (I'm guessing that's Steph) could look through each of these 3 examples and advise what the relationship between taxa and samples (slides, freeze dried, chemically preserved) should be. Then we can collectively determine whether there is actually an issue with how they are being presented and/or if there are adjustments we can make on either side to make it work.

Kociolek_FLNT.20200915.EPIPSAMMON.4 (1).csv

EcoAnalysts_BLWA.20191028.EPIXYLON.7 (2).csv

Drexel_SUGG.20150714.EPIPHYTON.4 (1).csv

Steph:

for these we are actually able to tell with a little information from the data on the NEON portal, unlike macroinvertebrates. I don't fully know hoew ed is harvesting these and I know that information exists in two different tables (ptx_taxonomy and ptx_archive) so if there isn't enough of a connection between the two we might be able to do something to help with that.

Looking at the attached files here's what I see:

alg_domainLab_in.sampleIDchem is not a sampleClass that goes to ASU Nothing comes from the freeze dried vial, that's excess material that is being archived but would be most similar to the taxa from the slides alg_domainLab_in.sampleIDtaxonomy & analysis_type = diatom slide >> comes from the slide alg_domainLab_in.sampleIDtaxonomy & analysis_type = soft algae >> comes from the preserved vial

Kelsey:

This is very helpful and may make it possible for us to pull in the info for many samples. How can we solve instances in which it looks like multiple slides (see EcoAnalysts example) or multiple chemically-preserved samples (see Drexel example) were created from the same parent though?

Steph:

Well, neither lab should have been using multiple preserved fractions or slides, that is not something we are expecting. For the Drexel example, I don't know why they had multiple preserved fractions, they shouldn't have. It looks like they recorded them in the slideID field in the ptx_taxonomy_in table along with each scientificName though, so data can be tracked that way. We stopped working with them in 2018, so that would be a workaround that doesn't continue for other labs. For the EcoAnalysts slide example, I have no idea why there are 2 slides. They must have made 2 and only used one. Eco is famously bad for sampleIDs, and I see that they entered slideID = "8070.39-03" with the taxonomy data then sent you 2 totally different strings for the slideIDs, BLWA.20191028.EPIXYLON.7.TAXONOMY.SLIDE1 and BLWA.20191028.EPIXYLON.7.TAXONOMY.SLIDE2. Is one of them diamond scribed with a starting point or a line and one not? Julian, maybe we can ask EcoAnalysts what is going on with this? Again, this will not persist into the future since we are re-working the lab SOP.

kyule commented 1 month ago

Consider determinations aspect in the context of the subsampler.