Deduplicate entries before generating DwCA

ManonGros commented 3 years ago

How can we ensure that we dont have these duplicates (https://github.com/gbif/portal-feedback/issues/2064)? From what I can see on the issue, there are two solutions:

giving preference to Sample for sample-linked records and using set masters
cluster by prefix which is 4 chars or longer

Although I am not sure which prefix we are talking about

I dont see any sample identifier in the API call given as example: https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=geo_box1(-90%2C-180%2C90%2C180)&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex

In this example, I suspect that the first four results refer to the same organism For example:

Have the same organism, same location/scientific name/associated publication but one is a sequence of ghbA2 mRNA for giant hemoglobin A1 globin chain and the other one a sequence of ghbA2 mRNA for giant hemoglobin A2 globin chain. if I add “sample_accession”, it wouldn’t help in this case (https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=geo_box1(-90%2C-180%2C90%2C180)&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex,sample_accession) since they didn’t fill this field.

I guess we could also group the data by date, location, scientific name? In most cases, this would solve the issue.

thomasstjerne commented 3 years ago

I suggest that we cluster on location, date and sequence_md5 and keep the count of each sequence_md5. The count should be written to the field organismQuantity and accompanied by organismQuantityType: "DNA sequence reads"

This will produce some duplicates, but maximum 1 pr ASV. And it is actually encouraged to publish data at the ASV level in the Publishing DNA-derived data through biodiversity data platforms guide.

In cases where a sample_accession is present, a pre-grouping on this attr would allow to add the sampleSize, but we could look into that later on.

timrobertson100 commented 3 years ago

I suggest that we cluster on location, date and sequence_md5 and keep the count of each sequence_md5.

I am not sure I follow this. Do you mean group by location and date and count the distinct MD5 perhaps?

Where do occurrenceQuantity occurrenceQuantityType come from please? (they are not DwC)

thomasstjerne commented 3 years ago

Where do occurrenceQuantity occurrenceQuantityType come from please? (they are not DwC)

Sorry, meant organismQuantity and organismQuantityType

timrobertson100 commented 3 years ago

Thanks.

organismQuantity and organismQuantityType are for quantifying the number/amount of the organism/mass. If we're trying to capture metadata about the number of reads of the sequence made, perhaps the measurementsAndFacts extension would be a better fit?

...cluster on location, date and sequence_md5 and keep the count of each sequence_md5

I'm sorry, I still don't understand the proposal here. If you cluster (group by?) the sequence_md5 how do you track a count of it? Was the intention to group on location and date and count distinct sequences? If so, I think we'd need smarter grouping behaviour as there will null values in both fields I think - perhaps grouping on the IDs used would be an option?

thomasstjerne commented 3 years ago

I am not sure I follow this. Do you mean group by location and date and count the distinct MD5 perhaps?

I mean SELECT location, date, sequence_md5, COUNT(*) as organismQuantity FROM ...... GROUP BY location, date, sequence_md5;

I.e. 1 unique sequence pr day pr lat/lon. And of course only where location and date is not null. (I dont think sequence_md5 could ever be null)

You will find duplicates of the same sequence on the same date and same lat/lon. We want those collapsed into one occurrence to avoid this).

mike-podolskiy90 commented 3 years ago

@thomasstjerne There are some records with absent sequence_md5 https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&format=json&limit=100&query=accession=%22AB536497%22&fields=accession,location,country,identified_by,collected_by,collection_date,specimen_voucher,sequence_md5,scientific_name,tax_id,altitude,sex

By date you mean collection_date ?

timrobertson100 commented 3 years ago

By date you mean collection_date ?

Yes.

My understanding from discussions offline yesterday: Because there is no way to categorically identify the records pertaining to one sequencing run, we're aiming to group by location+date and then will capture the following:

taxonID: ASV:abcd123      (the taxon ID that has been assigned based on sequence clustering)
organismQuantity: 2      (the number of times the taxon is seen)
organismQuantityType: DNA sequence reads
sampleSizeValue: 7      (the number of reads made at that location/date pairing)
sampleSizeUnit: DNA sequence reads

This would allow us to

Calculate the relative abundance per taxon (2/7 in this example, so 28%)
Add a threshold so that there needs to be sufficient evidence of a taxon (> N%) to e.g. remove the spurious records of land mammals in the middle of the ocean.

Is this correct please? If so, we should document it and the implications of the approach for different scenarios.

ManonGros commented 3 years ago

@thomasstjerne I don't think that having one entry per MD5 is a good idea. One of the main reason we want to deduplicate is because of the type of issues I mentioned above (https://github.com/gbif/portal-feedback/issues/2064) where we had 3 million records for one tree. I went back and checked a few of these records and they have different MD5s, so this wouldn't solve the issue. See the examples below:

thomasstjerne commented 3 years ago

@ManonGros this is definitely something to aware of, indeed there will be different ASVs. This is however a rather extreme case because it is a whole genome. The Picea glauca case seems like a good test, seemingly there are 40 unique sequences representing different parts of the genome from that study https://www.ncbi.nlm.nih.gov/nuccore?term=83435%5BBioProject%5D Also, it seems that the sequences you point has been removed and replaced by the assembly: https://www.ncbi.nlm.nih.gov/nuccore/ALWZ04S2210885 I would hope that the API wont give us logically deleted records.

So if the 3 million occurrences are collapsed into 40 occurrences of the same tree, but keeping the information about the 40 distinct ASVs, wouldn´t that be OK?

If it doesn´t work you are right that we would have to group by the scientific name.

Its just that with metabarcoding data with marker genes / fragments, it would be really useful to be able to query by ASV identity, and we would loose this by grouping on scientificName

ManonGros commented 3 years ago

I don't know if the API would return those 3M records, but they can be found using the web interface: https://www.ebi.ac.uk/ebisearch/search.ebi?db=emblcon&query=Picea%20glauca&size=15

40 ASVs could be ok but I suggest that we check if this would actually remove most duplicates for some of the other cases we have seen before. I am thinking about the 169K bat records that were also excluded from the GBIF dataset: https://www.ebi.ac.uk/ebisearch/search.ebi?db=emblcon&query=Myotis%20brandtii&size=15, the 81K nematodes of NZ, etc.

We need to estimate how many duplicates we would generate. I guess if it remains in the single or double digits, it could still be ok. We need to keep in mind that most of our users are probably not be aware of ASVs and might find the amount of data cleaning to do off-putting.

dschigel commented 3 years ago

We resume the deduplication discussion, aiming to lump multi-read occurrences. We look at two striking cases together, Myotis (GBIF: https://www.gbif-uat.org/occurrence/2886368364 ENA: https://www.ebi.ac.uk/ena/browser/api/embl/KE161411.1?lineLimit=1000) and Picea (GBIF: https://www.gbif-uat.org/occurrence/2892906219, ENA: https://www.ebi.ac.uk/ena/browser/api/embl/ALWZ04S0000610). These fields were used to build an adaptor https://www.ebi.ac.uk/ena/portal/api/returnFields?dataPortal=ena&format=json&result=sequence.

QUESTIONS

Is there a field, from any of two ENA API's that can identify assembly data?
And can be used for deduplicaiton of such records?
Is DATACLASS: CON such a field?
anything non-null in "Genome-Assembly-Data-START" or similar?
any md5 possibilities as discussed above?
Guy's recommendations https://github.com/gbif/portal-feedback/issues/2064#issuecomment-535791659, note "for deep semantics (long history and different annotation practices depending on source, etc.) good to discuss with ENA team"

ManonGros commented 3 years ago

Thanks @dschigel for the suggestions!

I have checked the md5 for the example mentioned and they are not helpful. Actually, if I remember correctly @thomasstjerne gave it a try as well but the dataset still contained all these duplicates.
As far as I can tell, fields such as DATACLASS: CON or Genome-Assembly-Data-START are not available via the API directly. @mike-podolskiy90 would have to write something to check each record queried (which I don't think would be very efficient).
Guy's suggestion to check the accession numbers in order to identify if a record is part of a contig set (https://github.com/gbif/portal-feedback/issues/2064#issuecomment-535791659) seems to fit the Picea example. However, it wouldn't work for this other example where different mRNA were sequenced (https://www.ebi.ac.uk/ena/browser/api/embl/AB185392.1?lineLimit=1000 and https://www.ebi.ac.uk/ena/browser/api/embl/AB185391.1?lineLimit=1000). But maybe it wouldn't be as bad.
But in any case, knowing that a record is part of a contig, doesn't tell us which other records they should be clustered with. We still need to figure it out somehow. Maybe cluster based on the first four characters of the accession numbers? Or just location/date/species?

@mike-podolskiy90 how much work would it be to test the two following approaches on this dataset INSDC sequences - excluding environmental samples and sequences associated with host organisms?

Group the data by scientific name/date/location (and catalogue number when available)
Check the accession numbers (=occurrenceID) and if they match "[A-Z]{4}[0-9]{2}S?[0-9]{6,8}" or "[A-Z]{6}[0-9]{2}S?[0-9]{7,9}", group them based on [A-Z]{4}[0-9]{2} or [A-Z]{6}[0-9]{2} (depending on what they matched).

mike-podolskiy90 commented 3 years ago

@ManonGros a couple of days, I guess

gbif / embl-adapter

Deduplicate entries before generating DwCA #3