Small (reference) data for testing

apcamargo / genomad

geNomad: Identification of mobile genetic elements

https://portal.nersc.gov/genomad/

Other

169 stars 17 forks source link

Small (reference) data for testing #104

Closed bernt-matthias closed 2 weeks ago

bernt-matthias commented 2 weeks ago

Is there any small reference data set (and fasta) that could be used for testing.

Background: I'm thinking about creating a tool wrapper for Galaxy and those require tests.

apcamargo commented 2 weeks ago

Do you think the Klebsiella pneumoniae that is used in the guide is small enough?

curl -LJO https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_009025895.1/download\?include_annotation_type\=GENOME_FASTA

bernt-matthias commented 2 weeks ago

The fasta should be fine. I guess one could even use a subsequence of this genome to reduce runtime and memory requirements of the test.

But I was wondering more about the reference data that you have on zenodo (i.e. that is downloaded with genomad download-database).

apcamargo commented 2 weeks ago

You could use mmseqs createsubdb to create a subset of the database.

Within the database directory, the mini_set_ids file contains the IDs of 42,098 markers (~20%) that comprise a "mini database" with the most informative markers. This can be used as input to mmseqs createsubdb.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

bernt-matthias commented 2 weeks ago

You could use mmseqs createsubdb to create a subset of the database.

Wonderful.

You could create an even smaller database if create a database containing only the markers with hits in the test sequence.

Could you tell me where in the output I can find the IDs of the markers for the ids file?

apcamargo commented 2 weeks ago

genomad_db/genomad_db.lookup has the ID → marker accession mappings (first and second columns, respectively):

0   GENOMAD.070201.VV   0
1   GENOMAD.179093.PC   0
2   GENOMAD.152930.VV   0
3   GENOMAD.102389.VV   0
4   GENOMAD.094353.VV   0

To get a list of the accessions of markers with hit in the test genome:

awk -v FS="\t" 'NR>1 && $9!="NA" {print $9}' genomad_output/GCF_009025895.1_annotate/GCF_009025895.1_genes.tsv | sort -u

After you create the sub-database it's not guaranteed that the matches will be the same, as the database size will change significantly. It should work for test purposes anyway.

bernt-matthias commented 2 weeks ago

Excellent. Got it down to 23MB (as tar.gz) which is still to large for our repo but it will help a lot anyway.

I would put this on zenodo or would you be interested in doing it with your account?

apcamargo commented 2 weeks ago

Great!

One thing you can do to reduce the size of the database a bit and make the test faster is to reduce the search sensitivity in geNomad (setting -s 1, for example). This is will lead to less markers with hits and the runtime will be shorter.

I think it's best if you upload it yourself, since you'll be using it. But please share the link once its up!

bernt-matthias commented 2 weeks ago

Thanks for the help. Here is the link: https://zenodo.org/records/11945948

Galaxy tool wrappers should be finished soon as well: https://github.com/Helmholtz-UFZ/galaxy-tools/pull/29

apcamargo commented 2 weeks ago

Awesome! Thanks!