pipeline not filtering duplicates

johrstrom commented 2 years ago

Taken from an email from @shastara

MF743527_1414715705	MF743527	1414715705	44.453	-75.865	MATERIAL_SAMPLE	0	GEODETIC_DATUM_ASSUMED_WGS84	g
MF743527_1842029920	MF743527	1842029920	44.45	-75.86	PRESERVED_SPECIMEN	0	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID	g
MF743527_2305321812	MF743527	2305321812	44.45	-75.86	PRESERVED_SPECIMEN	0	COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84	g

These are an example where there are multiple GBIF IDs but only one GenBank accession. Since the geographic coordinates are the same, they should not be flagged with 'g', and should be going through the flowchart (fig. 2 attached) checking other information for that occurrence from GBIF (event date, species name, etc). Occurrences flagged with 'g' that have the same GenBank ID, but different GBIF IDs and different coordinates, need to be duplicated in the alignment. In the alignment, there is only one sequence (MF743527_1414715705). Right now there isn't enough information for me to see if these occurrences should be removed, or if the sequence should be duplicated in the fasta file. I think that also adding the event date from GBIF to the occurrences file will be helpful here.

MG361146_1841582353	MG361146	1841582353	43.51	-80.17	PRESERVED_SPECIMEN	0	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID	d
MG361146_2306891750	MG361146	2306891750	43.52	-80.17	PRESERVED_SPECIMEN	0	COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84	d

In this case, the 'd' flag should be 'g' because the coordinates are different. And, only MG361146_1841582353 is in the alignment file, but multiple occurrences exist that have different coordinates, so there needs to be a sequence duplicated and named MG361146_2306891750.

KR378450_1414258620	KR378450	1414258620	46.6553	-60.4285	MATERIAL_SAMPLE	0	GEODETIC_DATUM_ASSUMED_WGS84	g
KR378450_1842029948	KR378450	1842029948	46.65	-60.42	PRESERVED_SPECIMEN	0	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID	g
KR378450_2308488211	KR378450	2308488211	46.66	-60.43	PRESERVED_SPECIMEN	0	COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84	g

In this case, the g flag makes sense because the geographic coordinates are different (though they are supposed to be rounded to 2 decimals and they are not), but the sequences are not in the alignment (again, only the first one is there)

johrstrom commented 2 years ago

More from @skdecker

Duplicated occurrences (same accession, different source ID, coordinates differ only due to rounding it seems)
For these (e.g., Myotis nigricans (KX814404) and Pipistrellus kuhlii (JF443066) attached) it seems that the duplications are due to them being entered into GBIF with only the GenBank accession and then again with the Genbank and BOLD information. The BOLD site includes the GenBank IDs so it seems to be a break in the pipeline where BOLD records are supposed to be discarded if they include a GB accession aready in phylogatR (3a of the attached BOLD sorting scheme).
"Random" entries in the occurrence.txt not associated with a sequence in the .fa or .afa files
Two examples attached (Barbasella barbastellus (MH accessions) and Platyrrhinus aurarius (KM accessions)). I looked up the accession numbers provided in the occurrence files and they are for genes that are not represented in the genes.txt file or download. Both of these examples only had sequences for COI in the download but these extra occurrences were associated with atp7a genes on GenBank. Could just be an error in how the GBIF data are being aggregated (GBIF IDs associated with GB accessions but not the correct genes?) but I hadn't run into that problem until my most recent downloads.
There are disproportionately more mis-identified sequences/specimens from the BOLD dat a compared to the GB data but I assume that it varies by taxonomic group and I'm not sure if there's really anything that we can do about that. And we do expect a certain amount of error in such datasets.

BadBats.zip

johrstrom commented 2 years ago

From that first message around the MF743527 accession - I got those gs to turn into ds and filter 1 record.

These are the records in the database I just generated:

id	accession	source_id	lat	lng	basis_of_record	coordinate_uncertainty_in_meters	issue	different_genbank_species	species_id	source	field_number	catalog_number	identifier	event_date	genes	flag
1511772	MF743527	1842029920	44.45	-75.86	PRESERVED_SPECIMEN	GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID	46665	0	GMP#03815	CNTID4912-15	BIOUG21001-B10	2014-06-18T00:00:00	COI	d
1511773	MF743527	2305321812	44.45	-75.86	PRESERVED_SPECIMEN	COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84	46665	0	MF743527	2014-06-25T00:00:00	COI	d

I believe I've also found a bug in flagging. We're checking all records as a group instead of n^2 comparisons, so there could be some erroneously flagging that way.

johrstrom commented 2 years ago

related to #74. Indeed when this fix is published ten we can move on to #74.

OSC / phylogatr-web

pipeline not filtering duplicates #12