geronimp / enrichM

Toolbox for comparative genomics of MAGs
81 stars 22 forks source link

Scripts to update the database #106

Closed apcamargo closed 4 years ago

apcamargo commented 4 years ago

Hey @geronimp!

I see that's been some time since EnrichM's database was last updated. I imagine that you don't got the time to maintain the database these days. It's understandble.

I just updated my database to use the latest version of the Pfam an the KofamKOALA database. Replacing the HMM and threshold file is easy, but the remaining files require some manual work and I can't guarantee that I'm doing things the same way you did.

Do you think you can provide the scripts to generate the files in the database (eg.: the dictionaries in the pickle files, the KEGG module definition file etc.)?

geronimp commented 4 years ago

Hi there,

Thank you for your interest in enrichM. I just merged an older version of the script that was used to generate those files, but the way the KEGG module definitions, and the remaining pickle files are generated remains unchanged:

https://github.com/geronimp/enrichM/pull/97/commits/7258079bb80e8a861e94e5a60bea885d9bb02f40

you should be able to pull out what you need from there.

Thanks,

Joel

apcamargo commented 4 years ago

Thanks @geronimp !

apcamargo commented 4 years ago

Hey @geronimp!

I've updated the KEGG database (to release 94.0+) and Pfam (to release 33.1). I managed to update every KO/Pfam-related files (with the exception of ko00000.tsv). Would you be interested in the files?

SvenTobias-Hunefeldt commented 4 years ago

Hi @apcamargo

I'm running the tool at the moment and was wondering if there was a way for you to share the updated database files?

Best, Sven

apcamargo commented 4 years ago

Hey @SvenTobias-Hunefeldt

The database itself is quite large, but I can send you the updated pickle files and thresholds for the HMM KO (which I found works better than the Diamond approach to assign KOs).

You can download the newest Pfam and KEGG HMMs in the links below: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.1/Pfam-A.hmm.gz ftp://ftp.genome.jp/pub/db/kofam/archives/2020-05-10/profiles.tar.gz

The KEGGs HMMs are in individual files, so you must concatenate them.

You should replace the pfam.hmm and ko.hmm files within the databases directory.

Let me know if you manage to do that then I'll upload the smaller files.

SvenTobias-Hunefeldt commented 4 years ago

Hi @apcamargo,

I've replaced the database files (thanks for the links!).

Also when you said hmm works better than diamond, does that relate to speed or accuracy? As I'm getting the same results from both.

apcamargo commented 4 years ago

Here you go! Just replace the original files.

Actually, I've found that the HMM-based KO assignment works better for me. It is a bit more sensitive and, now that you downloaded a newer database, it is based on a more recent KEGG release.