Closed bernt-matthias closed 2 weeks ago
Do you think the Klebsiella pneumoniae that is used in the guide is small enough?
curl -LJO https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_009025895.1/download\?include_annotation_type\=GENOME_FASTA
The fasta should be fine. I guess one could even use a subsequence of this genome to reduce runtime and memory requirements of the test.
But I was wondering more about the reference data that you have on zenodo (i.e. that is downloaded with genomad download-database
).
You could use mmseqs createsubdb
to create a subset of the database.
Within the database directory, the mini_set_ids
file contains the IDs of 42,098 markers (~20%) that comprise a "mini database" with the most informative markers. This can be used as input to mmseqs createsubdb
.
You could create an even smaller database if create a database containing only the markers with hits in the test sequence.
You could use mmseqs createsubdb to create a subset of the database.
Wonderful.
You could create an even smaller database if create a database containing only the markers with hits in the test sequence.
Could you tell me where in the output I can find the IDs of the markers for the ids file?
genomad_db/genomad_db.lookup
has the ID → marker accession mappings (first and second columns, respectively):
0 GENOMAD.070201.VV 0
1 GENOMAD.179093.PC 0
2 GENOMAD.152930.VV 0
3 GENOMAD.102389.VV 0
4 GENOMAD.094353.VV 0
To get a list of the accessions of markers with hit in the test genome:
awk -v FS="\t" 'NR>1 && $9!="NA" {print $9}' genomad_output/GCF_009025895.1_annotate/GCF_009025895.1_genes.tsv | sort -u
After you create the sub-database it's not guaranteed that the matches will be the same, as the database size will change significantly. It should work for test purposes anyway.
Excellent. Got it down to 23MB (as tar.gz) which is still to large for our repo but it will help a lot anyway.
I would put this on zenodo or would you be interested in doing it with your account?
Great!
One thing you can do to reduce the size of the database a bit and make the test faster is to reduce the search sensitivity in geNomad (setting -s 1
, for example). This is will lead to less markers with hits and the runtime will be shorter.
I think it's best if you upload it yourself, since you'll be using it. But please share the link once its up!
Thanks for the help. Here is the link: https://zenodo.org/records/11945948
Galaxy tool wrappers should be finished soon as well: https://github.com/Helmholtz-UFZ/galaxy-tools/pull/29
Awesome! Thanks!
Is there any small reference data set (and fasta) that could be used for testing.
Background: I'm thinking about creating a tool wrapper for Galaxy and those require tests.