gbif / doc-publishing-dna-derived-data

This guide shows how to publish DNA-derived spatiotemporal biodiversity data and make it discoverable through national and global biodiversity data discovery platforms. Based on experiences from Australia, Norway, Sweden, UNITE, and GBIF.
https://doi.org/10.35035/doc-vf1a-nr22
Other
2 stars 7 forks source link

Incorportation of control data #162

Open CecSve opened 2 years ago

CecSve commented 2 years ago

Based on https://github.com/iobis/Project-team-Genetic-Data/issues/8.

We currently do not have any recommendations as to how to treat controls in the publication process. It should be possible to report sequences in blank or positives, but it should also be very clear to data users that the 'occurrences' are found in laboratory controls! E.g, we should avoid adding location of the lab or some other potentially misleading supplementary data to the occurrence record.

For now, we can only encourage users to provide extensive metadata on how controls were used in the study.

CecSve commented 2 years ago

@jenast do you include controls when you publish metabarcoding data from the Malaise traps to GBIF?

dschigel commented 2 years ago

True, we don't yet and we need to consider carefully if we should. There are different tasks mixed here: to report your study procedures as accurantely as possible and to report evidence of species presence / absence / abundance in nature. Controls are only part of the former, not of the latter. In a way this information is as much metadata as e.g. pcr details or primer: it helps to figure out how the detection was carried out, but not was detected. In a nutshell, my hunch that we may consider adding optional fields to capture control related information, but never mix it with field detections (whatever is the outcome of controls). Do you feel the same way?

CecSve commented 2 years ago

In a nutshell, my hunch that we may consider adding optional fields to capture control related information, but never mix it with field detections (whatever is the outcome of controls). Do you feel the same way?

Yes, I agree it should not be mixed with field detections but the way controls are dealt with may vary quite a bit between studies and it would be great if we could capture that variation in a meaningful way.

Just on top of my head, I can think of three control scenarios, however maybe more exist:

As mentioned, reporting on the different controls would be challenging and highly study/research field dependent (details should be included in the metadata). The suggestions below is just to get the conversation started so we can figure out a way to best incorporate such data. For example:

Contaminants in extractions Issue: one extraction blank may be a blank for multiple samples: one or more blank(s) used for each extraction 'event' which could be extraction of several samples on a given day. So contaminants arising from this step may affect multiple samples.

Recommendation : the taxa should be reported as a contaminant of all samples extracted that day or the taxa OR they could be removed from the affected samples (a practice used in microbiology as far as I know) and this should ideally be reported in the metadata. This would require that the publisher kept track of the extracted samples by for example registering extraction date.

Contaminants in DNA amplification Issue: basically the same issue as with extractions, BUT typically relating to PCR runs of 8-strips (x1-12) or 96-plates so other samples than the extracted samples may relate to this contaminant check.

Recommendation : same recommendation as above. Again the publisher would need to keep track of exactly which samples where amplified at the time of the PCR negative.

Positive controls Issue: basically the same issue as with extractions - the positive control is usually included for multiple samples processed at the same time.

Recommendation : I am not sure about how this one should be handled - I only have experience with positive controls in a 'mock community' sense, where some of my mixed samples where identified prior to amplification. But as mentioned, the way people use positive controls is different between studies if at all.

(Just pasting this in here so we remember the good suggestion: https://github.com/iobis/Project-team-Genetic-Data/issues/8#issuecomment-1150297781)

jenast commented 1 year ago

Hi,

I would have to check with Frode Fossøy to be sure. But I think we do:

Some sort of negative check in the lab. Not sure if it's "extractions blank" or just PCR negative, but we do at least one of the two. I assume a failure here would warrant a rerun of the processing to get a valid result. We also spike our samples with a known amount of alien species as a positive test. We don't consider this a requirement for a valid result, as we know we struggle to find some of the spiked species. Anyway, we don't include these findings in the results.

We don't have a full "field negative", where we fill up a bottle, take it to the field, unscrew caps, add labels, top it up with ethanol and bring it back again.

I'm not sure of the best way to include information about these things in a standardized form in gbif. I guess we would describe this in the metadata in text form.

Still waiting for time to update our export with some updated metadata and possibly some new use of darwin core terms.

/Jens

On Thu, Jun 9, 2022 at 12:45 PM CecSve @.***> wrote:

Just on top of my head, I can think of three control scenarios, however it is not standardised across studies (yet, I think):

  • extractions blank (processed under same conditions as source material, but should contain no significant DNA conc.)
  • PCR negative (no DNA added, biology grade water added)
  • PCR positive (e.g mock communities, spike-in's etc.)

They could all three be optional fields on sample level and include # of reads or perhaps a ratio of # of reads/total sample reads http://rs.tdwg.org/dwc/terms/sampleSizeValue although ratio could be left for the user to calculate. I'm not sure to how to link different reads from various controls at a sample level - I have to look at some data to visualise a model layout..

— Reply to this email directly, view it on GitHub https://github.com/gbif/doc-publishing-dna-derived-data/issues/162#issuecomment-1150965247, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE5QLYRHQIB7MOELJ6MC5TVOHDLPANCNFSM5YFVP4TA . You are receiving this because you were mentioned.Message ID: @.***>

CecSve commented 1 year ago

Thanks Jens - for now, describing it in the metadata is a good option. Good luck with the updates!

ymgan commented 1 year ago

Coming back to this because I saw the same issue was raised again in a Slack workspace. The new MIxS release has the terms

neg_cont_type MIXS:0001321 with enumeration

Value Meaning Description
distilled water None  
phosphate buffer None  
empty collection device None  
empty collection tube None  
DNA-free PCR mix None  
sterile swab None  
sterile syringe None  

As well as pos_cont_type MIXS:0001322

From what I understood, these two fields in MIxS which are inherited by all MIxS checklists, just like samp_name, project_name, experimental_factor that are already in GBIF's DNA-derived data extension

I guess my question is, was there a conversation between GBIF and folks from MIxS and Sustainable DarwinCore MIxS Interoperability Task Group about this? I believe that folks from MIxS must have encountered similar issues and I felt that the comments above are conversation-worthy!

Thanks a lot!