DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

Homininae prebuilt database - Problem #30

Open rlibouba opened 9 months ago

rlibouba commented 9 months ago

Hi, I'm working on the French Galaxy instance, and we want to integrate OMArk and add pre-built OMAmer databases. While testing the available databases, I encountered a problem with the Homininae.h5 database.

Here's the error I get : WARNING: The selected ancestral lineage is from the phylum rank or higher which means the target species' taxonomic division is not well sampled in our database. The results may lack accuracy. Traceback (most recent call last): File "/home/rlibouba/.conda/envs/mamba/envs/omark/bin/omark", line 51, in <module> omark.launcher(arg) File "/home/rlibouba/.conda/envs/mamba/envs/omark/lib/python3.10/site-packages/omark/omark.py", line 265, in launcher get_omamer_qscore(omamerfile, dbpath, outdir, taxid, original_FASTA_file = original_fasta, isoform_file=isoform_file, taxonomic_rank=taxonomic_rank) File "/home/rlibouba/.conda/envs/mamba/envs/omark/lib/python3.10/site-packages/omark/omark.py", line 118, in get_omamer_qscore LOG.info('Ancestral lineage is '+closest_corr) TypeError: can only concatenate str (not "NoneType") to str /home/rlibouba/.conda/envs/mamba/envs/omark/lib/python3.10/site-packages/tables/file.py:113: UnclosedFileWarning: Closing remaining open file: ../db/Homininae.h5 warnings.warn(UnclosedFileWarning(msg))

Can you help? This error was obtained with this command line: omark -f file.omamer -d /db/Homininae.h5 -o omark_output

YanNevers commented 9 months ago

Hello @rlibouba ,

Thanks for reporting this error. This is an unexpected issue that seem to happen when using a database covering a clade with not enough species in our reference database, which conflict with OMArk inner working. I'll try to make OMArk answer with a more informative error message. Nevertheless, I would recommend only using OMArk with broader databases. This is because OMArk need a broad enough taxonomic coverage to check what taxonomic level proteins are assigned to and to make a confident Consistency assessment. Unless there are compute resources limitations on your infrastructure, LUCA.h5 will always be the best choice. Otherwise, Metazoa.h5 and Viridiplantae.h5 would still enable some of OMArk's features. Other than those, we recommend not using other clade databases that are made available for other use case of OMAmer.

Cheers, Yannis

rlibouba commented 9 months ago

Hello @YanNevers ,

Thank you for your explanations and your help.

Have a nice day, Romane

rlibouba commented 9 months ago

Hello @YanNevers ,

I'm coming back to you after your reply yesterday. I was able to discuss my problem with my colleagues. We do have resource limits for our tests. The ideal would be a database no larger than 5Mb but it would have to work with OMArk. Have you opened an issue on the OMArk github that I could participate in?

Have a nice day, Romane

YanNevers commented 4 months ago

Dear @rlibouba,

I apologize, I missed your latest message and this issue felt through my awareness. If you are still needing a small test database, I can explore the available option and try to come up with a quick solution to this issue.