OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

pipeline not filtering duplicates #12

Open johrstrom opened 2 years ago

johrstrom commented 2 years ago

Taken from an email from @shastara

MF743527_1414715705 MF743527 1414715705 44.453 -75.865 MATERIAL_SAMPLE 0 GEODETIC_DATUM_ASSUMED_WGS84 g
MF743527_1842029920 MF743527 1842029920 44.45 -75.86 PRESERVED_SPECIMEN 0 GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID g
MF743527_2305321812 MF743527 2305321812 44.45 -75.86 PRESERVED_SPECIMEN 0 COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84 g

These are an example where there are multiple GBIF IDs but only one GenBank accession. Since the geographic coordinates are the same, they should not be flagged with 'g', and should be going through the flowchart (fig. 2 attached) checking other information for that occurrence from GBIF (event date, species name, etc). Occurrences flagged with 'g' that have the same GenBank ID, but different GBIF IDs and different coordinates, need to be duplicated in the alignment. In the alignment, there is only one sequence (MF743527_1414715705). Right now there isn't enough information for me to see if these occurrences should be removed, or if the sequence should be duplicated in the fasta file. I think that also adding the event date from GBIF to the occurrences file will be helpful here.

MG361146_1841582353 MG361146 1841582353 43.51 -80.17 PRESERVED_SPECIMEN 0 GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID d
MG361146_2306891750 MG361146 2306891750 43.52 -80.17 PRESERVED_SPECIMEN 0 COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84 d

In this case, the 'd' flag should be 'g' because the coordinates are different. And, only MG361146_1841582353 is in the alignment file, but multiple occurrences exist that have different coordinates, so there needs to be a sequence duplicated and named MG361146_2306891750.

KR378450_1414258620 KR378450 1414258620 46.6553 -60.4285 MATERIAL_SAMPLE 0 GEODETIC_DATUM_ASSUMED_WGS84 g
KR378450_1842029948 KR378450 1842029948 46.65 -60.42 PRESERVED_SPECIMEN 0 GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID g
KR378450_2308488211 KR378450 2308488211 46.66 -60.43 PRESERVED_SPECIMEN 0 COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84 g

In this case, the g flag makes sense because the geographic coordinates are different (though they are supposed to be rounded to 2 decimals and they are not), but the sequences are not in the alignment (again, only the first one is there)

johrstrom commented 2 years ago

More from @skdecker

  1. Duplicated occurrences (same accession, different source ID, coordinates differ only due to rounding it seems)
  2. For these (e.g., Myotis nigricans (KX814404) and Pipistrellus kuhlii (JF443066) attached) it seems that the duplications are due to them being entered into GBIF with only the GenBank accession and then again with the Genbank and BOLD information. The BOLD site includes the GenBank IDs so it seems to be a break in the pipeline where BOLD records are supposed to be discarded if they include a GB accession aready in phylogatR (3a of the attached BOLD sorting scheme).
  3. "Random" entries in the occurrence.txt not associated with a sequence in the .fa or .afa files
  4. Two examples attached (Barbasella barbastellus (MH accessions) and Platyrrhinus aurarius (KM accessions)). I looked up the accession numbers provided in the occurrence files and they are for genes that are not represented in the genes.txt file or download. Both of these examples only had sequences for COI in the download but these extra occurrences were associated with atp7a genes on GenBank. Could just be an error in how the GBIF data are being aggregated (GBIF IDs associated with GB accessions but not the correct genes?) but I hadn't run into that problem until my most recent downloads.
  5. There are disproportionately more mis-identified sequences/specimens from the BOLD dat a compared to the GB data but I assume that it varies by taxonomic group and I'm not sure if there's really anything that we can do about that. And we do expect a certain amount of error in such datasets.

BadBats.zip

johrstrom commented 2 years ago

From that first message around the MF743527 accession - I got those gs to turn into ds and filter 1 record.

These are the records in the database I just generated:

id accession source_id lat lng basis_of_record coordinate_uncertainty_in_meters issue different_genbank_species species_id source field_number catalog_number identifier event_date genes flag
1511772 MF743527 1842029920 44.45 -75.86 PRESERVED_SPECIMEN GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID 46665 0 GMP#03815 CNTID4912-15 BIOUG21001-B10 2014-06-18T00:00:00 COI d
1511773 MF743527 2305321812 44.45 -75.86 PRESERVED_SPECIMEN COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84 46665 0 MF743527 2014-06-25T00:00:00 COI d

I believe I've also found a bug in flagging. We're checking all records as a group instead of n^2 comparisons, so there could be some erroneously flagging that way.

johrstrom commented 2 years ago

related to #74. Indeed when this fix is published ten we can move on to #74.