Open johrstrom opened 2 years ago
More from @skdecker
From that first message around the MF743527
accession - I got those g
s to turn into d
s and filter 1 record.
These are the records in the database I just generated:
id | accession | source_id | lat | lng | basis_of_record | coordinate_uncertainty_in_meters | issue | different_genbank_species | species_id | source | field_number | catalog_number | identifier | event_date | genes | flag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1511772 | MF743527 | 1842029920 | 44.45 | -75.86 | PRESERVED_SPECIMEN | GEODETIC_DATUM_ASSUMED_WGS84;COORDINATE_PRECISION_INVALID | 46665 | 0 | GMP#03815 | CNTID4912-15 | BIOUG21001-B10 | 2014-06-18T00:00:00 | COI | d | ||
1511773 | MF743527 | 2305321812 | 44.45 | -75.86 | PRESERVED_SPECIMEN | COUNTRY_DERIVED_FROM_COORDINATES;GEODETIC_DATUM_ASSUMED_WGS84 | 46665 | 0 | MF743527 | 2014-06-25T00:00:00 | COI | d |
I believe I've also found a bug in flagging. We're checking all records as a group instead of n^2 comparisons, so there could be some erroneously flagging that way.
related to #74. Indeed when this fix is published ten we can move on to #74.
Taken from an email from @shastara
These are an example where there are multiple GBIF IDs but only one GenBank accession. Since the geographic coordinates are the same, they should not be flagged with 'g', and should be going through the flowchart (fig. 2 attached) checking other information for that occurrence from GBIF (event date, species name, etc). Occurrences flagged with 'g' that have the same GenBank ID, but different GBIF IDs and different coordinates, need to be duplicated in the alignment. In the alignment, there is only one sequence (
MF743527_1414715705
). Right now there isn't enough information for me to see if these occurrences should be removed, or if the sequence should be duplicated in the fasta file. I think that also adding the event date from GBIF to the occurrences file will be helpful here.In this case, the 'd' flag should be 'g' because the coordinates are different. And, only
MG361146_1841582353
is in the alignment file, but multiple occurrences exist that have different coordinates, so there needs to be a sequence duplicated and namedMG361146_2306891750
.In this case, the g flag makes sense because the geographic coordinates are different (though they are supposed to be rounded to 2 decimals and they are not), but the sequences are not in the alignment (again, only the first one is there)