PHI-base / data

Archives of PHI-base data releases, and other data.
Creative Commons Attribution 4.0 International
7 stars 7 forks source link

Add lists of PHI-Canto pathogen and host strains #11

Closed CuzickA closed 2 years ago

CuzickA commented 2 years ago

Required for moving forward with PHI-Canto roll out and manuscript preparation.

CuzickA commented 2 years ago

@jseager7 are the strain lists generated from the latest PHI-base 4.12 release?

jseager7 commented 2 years ago

The PHI-Canto strain lists were last updated in April 2021, so I'm guessing not. I think the strain lists diverged from PHI-base a lot more than the species lists because there were lots of strains that weren't suitable for curation.

It would make sense to repurpose my PHI-base strain cleaning pipeline so that it can generate a PHI-Canto strain list from subsequent PHI-base releases, but that depends on how many more PHI-base v4.x releases are going to be made.

Any option could take some time because I'd have to manually compare the values between PHI-Canto and PHI-base v4.12. I can use scripts to speed up some comparisons but I'd still have to write rules for the strains that should be excluded or renamed, where synonyms should be added, etc.

CuzickA commented 2 years ago

Hmmm there will definitely be a PHI-base 4.13 release in ~May 2022 and possibly a 4.14 release in ~Nov 2022 dependant on the status of the PHI4 -> PHI5 data migration. MC curated data from 4.14 may be able to be loaded directly into PHI5.

For the PHI-Canto manuscript it would be nice to have all the lists up-to-date for the PHI-base 4.12 release.

jseager7 commented 2 years ago

Okay, I'll make sure that the new strains are merged before the manuscript is submitted.

jseager7 commented 2 years ago

I'll also make sure they're available in PHI-Canto by the time they're uploaded here.

jseager7 commented 2 years ago

The strain lists have been added by commit 36100324cea71c2cbae28bd4c7a679166ae21c26.