DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

OMArk error #13

Closed vkkodali closed 10 months ago

vkkodali commented 1 year ago

Pardon the uninformative title for the issue; I wasn't sure how exactly I should describe the error. I am trying to run OMArk (version 0.2.4) on a set of about 38000 proteins from the species Mercenaria mercenaria. Here's how I went about it:

$ omamer --version
0.2.3
$ omamer search --db ../omamer_databases/Metazoa.h5 --query proteins.fa --nthreads 8 --out proteins.omamer --log_level info
Searching: 0it [00:00, ?it/s]0 second to load query sequences
23 second to cumulate counts of reference HOGs
12 second to load the k-mer table
58 second for the actual search (~ 171 query/second)
Searching: 10000it [01:34, 105.90it/s]0 second to load query sequences
0 second to cumulate counts of reference HOGs
0 second to load the k-mer table
Searching: 10000it [01:50, 105.90it/s]16 second for the actual search (~ 595 query/second)
Searching: 20000it [01:53, 200.19it/s]0 second to load query sequences
0 second to cumulate counts of reference HOGs
0 second to load the k-mer table
15 second for the actual search (~ 642 query/second)
Searching: 30000it [02:09, 290.52it/s]0 second to load query sequences
0 second to cumulate counts of reference HOGs
0 second to load the k-mer table
Searching: 30000it [02:20, 290.52it/s]15 second for the actual search (~ 546 query/second)
Searching: 38330it [02:24, 264.43it/s]
$ omark --file proteins.omamer --database ../omamer_databases/Metazoa.h5 --outputFolder OMark_cli_lpg_results --verbose --taxid 6596
INFO: Starting OMArk
INFO: Input parameters passed validity check
INFO: Extracting data from input file: proteins.omamer
INFO: Determinating species composition from HOG placements
Traceback (most recent call last):
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/kodalivk/.conda/envs/omarkenv/bin/omark", line 45, in <module>
    omark.launcher(arg)
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/kodalivk/.conda/envs/omarkenv/lib/python3.9/site-packages/omark/omark.py", line 185, in launcher
    get_omamer_qscore(omamerfile, dbpath, outdir, taxid, original_FASTA_file = original_fasta, isoform_file=isoform_file)
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/kodalivk/.conda/envs/omarkenv/lib/python3.9/site-packages/omark/omark.py", line 65, in get_omamer_qscore
    placements = spd.get_present_lineages(full_match_data, hog_tab, tax_tab, tax_buff, sp_tab, chog_buff)  
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/kodalivk/.conda/envs/omarkenv/lib/python3.9/site-packages/omark/species_determination.py", line 56, in get_present_lineages
    all_plac  = get_likely_spec(t,filter_all_taxa_perc,filter_all_tax, tax_to_spec, proportion_hog_dup)
  File "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/home/kodalivk/.conda/envs/omarkenv/lib/python3.9/site-packages/omark/species_determination.py", line 124, in get_likely_spec
    cur_sp = tax_to_spec[t.name.encode()]
KeyError: b'LUCA'
Closing remaining open files:../omamer_databases/Metazoa.h5...done

Can you please help me with the error? Am I supposed to use a different database? It is unclear to me what the difference between OMAmerDB mentioned in the OMArk README and the other databases (such as LUCA.h5, Metazoa.h5, etc) mentioned in the omamer README page is. In this instance, I have used the latest version of Metazoa.h5.

YanNevers commented 1 year ago

Hello @vkkodali,

Thank you very much for reporting this issue.

I updated the README to make it clear how the OMAmerDB available through omark's (Zenodo)[https://zenodo.org/record/7359861#.Y_p493_MJQo] relate to the ones available through the OMA Browser, described in the omamer README. To sum it up, the OMAmerDB we made available through Zenodo is a copy of the LUCA.h5 database from the December2021 release of the OMA database (The current OMA release is from November 2022). I apologize for the confusion and I hope this updates clear it up, we will consider additional steps to make these differences more clear fin the future.

We recommend using omark with an OMAmer database constructed from all the species in the OMA database ( so-called LUCA.h5 database), which allows the species and contamination detection module to function as expected. Other databases (Metazoa.h5, Viridiplantea.h5, Primates.h5) only cover species from specific taxa and will only allow omark to detect species from these taxa.

I believe using either the above mentioned (OMAmerDB)[https://zenodo.org/record/7359861/files/OMAmerDB.gz] or (LUCA.h5)[https://omabrowser.org/All/LUCA.h5] from the OMA Browser should allow you to run the omark pipeline as expected with no more errors.

However, OMArk should in principle be also compatible with more specific databases. I will work on fixing this issue you reported as soon as possible and will keep you updated.

Thank you for your patience and let us know if you need any additional assistance.

vkkodali commented 1 year ago

Thank you @YanNevers, using LUCA.h5 for both omamer search and omark commands worked! Going forward, I will only use the latest version of LUCA.h5 for my work.

Do you know if there's any versioning of the LUCA.h5 database? How frequently is it updated? Is there any way to extract the metadata, especially the version information, from the database file itself? It will help when reporting results to specify which version of LUCA.h5 was used instead of saying something like "downloaded from www.omabrowser.org on 2023-02-24".

YanNevers commented 1 year ago

I am glad to read omark is now working for you!

To answer your questions, we aim to have release new release of the OMA database every 6 to 12 months, although specific time may vary depending on the amount of computation needed.They are generally announced through the OMA Browser main page, our newsletter, RSS feed and social media. You can access those in the bottom right corner of the Browser home page. Corresponding OMAmer databases will be released at the same time.

As for now, there is no explicit versioning of the LUCA.h5 database, but there is still a way to confirm which version of the dataset you are using. In the OMAmer output file, the first letter of the HOG identifiers is release specific since the Apr2021 release of OMA. Thus given the first letter you can infer:

However, we agree we need a more straightforward way to access metadata from the database itself. We plan to soon issue an update to OMAmer where release information and the parameters used to construct the database will be directly available through the command line. Additionally, we will add a feature where the databases issued from a main release of OMA will issue a warning if a more recent database is available online.

YanNevers commented 1 year ago

I just issued a patched omark release (0.2.5) where the initial issue (omark not working with Metazoa.h5 database) is fixed. This should have no impact over the use of LUCA.h5 database but will make sure omark is able to run with any omamer database going forward, as would be expected. Thank you again for reporting this!

YanNevers commented 10 months ago

As of previous comment, detailed OMAmer metadata is now available with the command omamer info --db DB . Hopefully, it will hlep with reporting results.