MDU-PHL / emmtyper

emm Automatic Isolate Labeller
GNU General Public License v3.0
10 stars 4 forks source link

Updating the database #21

Open erinyoung opened 1 year ago

erinyoung commented 1 year ago

Hi! I'd like to use emmtyper on some group A strep, but I'm foggy as to how often the database is updated.

Is there a way to update it on my end?

Daniel-VM commented 3 months ago

Hi,

I have added a Python script that:

Downloads and parses emm sequences from CDC's SFTP server.
Generates a multi-FASTA file containing all emm sequences.
Optionally creates a BLAST database from the multi-FASTA file, which can be used as input for emmtyper.

It can be accessed here: https://github.com/Daniel-VM/cdc-utilities

erinyoung commented 1 week ago

@Daniel-VM , thank you for your script! Forgive me for taking so long to try it out.

Daniel-VM commented 1 week ago

Hi @erinyoung,

I recently discovered that the CDC has uploaded a multifasta file containing all emm sequences, which simplifies things considerably. Now, we just need to periodically download the CDC multifasta and build the BLAST database. I recommend using their blastdb version included in the Singularity image available here: emmtyper:0.2.0--py_0 in the Galaxy repository.

I hope this helps!

JamesZlosnik commented 4 days ago

Hi @Daniel-VM. Could I just confirm that the right multifasta to use is the alltrimmed.tfa from https://ftp.cdc.gov/pub/infectious_diseases/biotech/tsemm/. rather than the untrimmed version that the CDC also offers.

Thanks in advance!

Daniel-VM commented 1 day ago

Hi @JamesZlosnik,

I recently noticed the file you mentioned. In my opinion, alltrimmed.tfa is the file we should use to build the BLAST database for emmyper.

I'm planning to update the script I mentioned above with alltrimmed.tfa and run a few tests.