OSC / phylogatr-web

The web app for the Phylogatr Project - https://phylogatr.org/
https://phylogatr.org/
MIT License
0 stars 0 forks source link

different_genbank_species column in a genes index file doesn't capture all the variations #15

Open johrstrom opened 2 years ago

johrstrom commented 2 years ago

[duplicated the UCR repo]

Right now the pipeline writes the "different_genbank_species" attribute to a Species the first instance where the difference is observed between an Occurrence and a Genbank record (this is where Genbank's taxonomy has a Species name that differs from the GBIF Occurrence taxonomy, becuase that is the taxonomy that is used). This is by design.

However, it is observed that: Some Species have multiple variations though some of those variations could be collapsed with a clever rule (one case had 70 variations).

I left the different_genbank_species captured in each Occurrence, so you can do in the rails console something like:

Species.where.not(different_genbank_species: nil).each do |s|
  x = s.occurrences.pluck(:different_genbank_species).uniq
  puts x if x.count > 1
end

to see where this is a problem.

Another issue is we don't differentiate between "different_genbank_species" per gene, but at the species level. It is likely that its good enough for the user to know there are differences and see an example difference though, since they can go back to the original records and view the details if they are really interested.