apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
168 stars 17 forks source link

Would there be a smaller database available to use for testing? #29

Closed AnnaSyme closed 10 months ago

AnnaSyme commented 10 months ago

I was wondering if you would know of a smaller database in the size of MB that could be used to test this tool?

Thanks if possible!

apcamargo commented 10 months ago

Hi @AnnaSyme, Are you referring to the marker database or a FASTA file with sequences that you can use to test the tool? If the later, the genome that is used in the quickstart guide is not enough for a quick test?

AnnaSyme commented 10 months ago

HI @apcamargo, Yes referring to the marker database. Thanks!

apcamargo commented 10 months ago

Ahh, ok! In this case you can use the --use-minimal-db parameter of the genomad annotate command. It will annotate the proteins with a subset of 42,098 markers. Just keep in mind that the classification performance will be below of what you would expect if you run geNomad with the full set of markers.

Because --use-minimal-db is not exposed to the end-to-end command, you'll have to run all the modules separately:

genomad annotate --use-minimal-db metagenome.fna genomad_output genomad_db
genomad find-proviruses metagenome.fna genomad_output genomad_db
genomad marker-classification metagenome.fna genomad_output genomad_db
genomad nn-classification metagenome.fna genomad_output
genomad aggregated-classification metagenome.fna genomad_output
# score-calibration is optional and not turned on by default in the end-to-end command
genomad score-calibration metagenome.fna genomad_output
genomad summary metagenome.fna genomad_output

Alternatively, you can just use reduce the search sensitivity (for instance, setting --sensitivity 1.4) and then use the end-to-end command to run the whole pipeline with the full set of markers. Again, you can expect the classification performance to take a hit.

apcamargo commented 10 months ago

If you just want to reduce the size of the database, you can do the following:

cd genomad_db
mmseqs createsubdb mini_set_ids genomad_db genomad_mini_db --subdb-mode 0
rm genomad_db

This will remove the full database file (1.4G) and replace it with a reduced version (348M). You will only be able to run geNomad with the --use-minimal-db parameter if you do that, though.

AnnaSyme commented 10 months ago

Thanks so much @apcamargo, this will be really useful.

apcamargo commented 10 months ago

Sure thing! :)

I'll close this issue for now. Let me know if you have any problems.