OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

GenBank and GBIF mismatch #69

Open johrstrom opened 2 years ago

johrstrom commented 2 years ago

(from @parsons463)

Hi all,

I did a bit more digging into the misidentified records that came up in my latest PhylogatR download, and I think I have a better idea of what's going on now. It looks like all of the issues with incorrect species being included in alignments are from records that have different species names listed in GenBank and GBIF.

For example, the record JF430990_897094035 is incorrectly included in the Sorex cinereus alignment, despite being from Sorex ugyunak. It looks like S. ugyunak used to be a subspecies of S. cinereus (i.e., Sorex cinereus ugyunak), but was then elevated to its own species. The taxonomy was updated in GenBank (where it's listed as Sorex ugunak) but not on GBIF (where it was incorrectly collapsed back to Sorex cinereus).

In other cases, it seems like the records were just incorrectly entered into either GenBank or GBIF, and because the GenBank and GBIF taxonomy disagree, they ended up in the wrong alignment. This seems to be how both the vole (AY305199_897089152) and the mouse (KF949213_897061606) ended up in a shrew alignment for Sorex cinereus.

Maybe we could think about adding a step to the pipeline where we check if the GBIF and GenBank taxonomy match and flag any incongruencies? I've attached a spreadsheet that lists the records I found to have this issue, in case anyone wants to look into it further. Let me know if you have any questions!

phylogatR_names_issue.xlsx

johrstrom commented 2 years ago

from @shastara

Hi all, Well, that makes sense because we did default to using the GBIF taxonomy, but there should already be a flag to indicate that GBIF and GenBank taxonomy don't match.

I think that there are some issues like this that are going to be impossible to fix in the pipeline and will need to be manually fixed and tracked - which was one of the things we were hoping to incorporate, and everntually get some numbers on. And also probably makes sense for us to put in tickets to GBIF/GenBank about their errors.

We decided to go with GBIF taxonomy since it's newer and might be more likely that those species names are correct. If we switch to defualting to GenBank taxonomy (because we need to pick one) we will have the same issues, just with different accessions.

The pipeline cleaning steps will not be able to find all these, but hopefully outlier tests will. I wonder if it's worthwhile to see if those alignments stand out and are easy to find - if so we can use those as examples in some of our data checking scripts.

Thanks Danielle!

First thing we should do, is make sure that those discrepencies are being flagged. Right now it looks like there is a column in the genes.txt file that indicates that that alignment has misidentifications, so I think we are good there.