OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

missing shrew data #58

Closed johrstrom closed 2 years ago

johrstrom commented 2 years ago

@parsons463 is reporting an issue where shrew data is being dropped.

parsons463 commented 2 years ago

Data seems to be dropped from the species Sorex palustris and Sorex tundrensis. The search terms used were 'Sorex palustris' and 'Sorex tundrensis'.

When I originally downloaded data for the species Sorex tundrensis back in 2020 I got 139 sequences from 4 different genes. The same download now returns just 18 sequences from the same 4 genes. Looking back at the 2020 occurences.txt file, the first 3 sequences are actually Sorex arcticus, so it makes sense that they were filtered out of the new download, but the other missing sequences seem like good data, from what I can tell. They're the correct species, have unique accessions and gbif IDs, and fit well in the alignments.

The same thing seems to be happening with Sorex palustris. We went from 82 sequences across 4 genes to 20 sequences from just a single gene. I've attached data from both old and new downloads for each of these species here.

phylogatR_comparison_MAR2022.zip

johrstrom commented 2 years ago

Thanks for the info! I'll update this ticket with what I find.

johrstrom commented 2 years ago

Sorry @parsons463 what was the area you searched over?

johrstrom commented 2 years ago

I found the 18 occurrences for Sorex tundrensis by just looking in the database. dev has 18, but production only has 6. I'll look to find the origin of the missing ~120.

johrstrom commented 2 years ago

I can trace this back to this file: gbmam30.seq. I see raw files in production for GenBank version 234. The current version on 248 on https://ftp.ncbi.nlm.nih.gov/genbank/ (at the time of writing on 3-15-22).

The accession_id for those sequences simply aren't in the genbank files anymore! I'm not quite sure how the production DB got in the state it's in, but even moving forward those sequences aren't from genbank anymore.

johrstrom commented 2 years ago

Looking back at this, I'm not sure how it ever worked.

Here's an old occurrence that got filtered (from occurrence.txt.filtered). I'm not sure how we extracted the genes soley from this record. They're supposed to be a part of those NCBI URLs, but nothing's given.

897089190 - which is missing

http://www.ncbi.nlm.nih.gov/nuccore/; http://www.ncbi.nlm.nih.gov/nuccore/; http://www.ncbi.nlm.nih.gov/nuccore/        897089190       67.753  -136.067        Animalia        Chordata        Mammalia   Soricomorpha    Soricidae       Sorex   Sorex tundrensis                        PRESERVED_SPECIMEN      TYPE_STATUS_INVALID;OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT;GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID;COORDINATE_UNCERTAINTY_METERS_INVALID             UAM:Mamm:77955  http://arctos.database.museum/guid/UAM:Mamm:77955?seid=1107058  1999-01-01T00:00:00

For reference, here's a good occurence. You can see the NCBI URLs all have a gene like HM992599 in them. This is what we're trying to extract but are unable to from the previous records.

1145175457 which still exists.

http://www.ncbi.nlm.nih.gov/nuccore/HM992599; http://www.ncbi.nlm.nih.gov/nuccore/HM992686; http://www.ncbi.nlm.nih.gov/nuccore/HM992763        1145175457      61.54973        130.88865       Animalia   Chordata        Mammalia        Soricomorpha    Soricidae       Sorex   Sorex tundrensis                1000.0  PRESERVED_SPECIMEN      TYPE_STATUS_INVALID;OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT               MSB:Mamm:148708 http://arctos.database.museum/guid/MSB:Mamm:148708?seid=1283619 2006-08-16T00:00:00
johrstrom commented 2 years ago

I think know where these were being dropped and will submit a pull request for the fix shortly.

@parsons463 I have have fixed this. Can you verify at - https://phylogatr-dev.osc.edu/

johrstrom commented 2 years ago

Well tundrensis seems to have more records now. palustris doesn't seem to have JF* type genes like JF436835 so I'll keep looking into that.

johrstrom commented 2 years ago

OK I've figured this out. https://phylogatr.osc.edu/ currently has the best data set it can at this time. Which is to say I think I can close this ticket.

tundrensis Currently returns ~72 occurrences out of Alaska. Total occurrences for tundrensis are 205, but most of them have a positive longitude instead of a negative. I.e., 177.545556 instead of -177.545556.

palustris doesn't seem to have JF* type genes like JF436835 so I'll keep looking into that.

Species with JF436835 don't live in Alaska - they're longitude is around -65 which seems to be Nova Scotia.

johrstrom commented 2 years ago

I'm going to close this as I believe it's as complete as it can be. #65 did fix this.

I'll open a ticket for wrong coordinates.