galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
165 stars 440 forks source link

omamer datamanager broken #6585

Open bernt-matthias opened 1 week ago

bernt-matthias commented 1 week ago

data_manager_omamer tests fail in weekly CI:

Error downloading https://omabrowser.org/All/Primates-v2.0.0.h5: 404 Client Error: Not Found for url: https://omabrowser.org/All/Primates-v2.0.0.h5

ping @rlibouba @SaimMomin12

They seem to have switched the URL schema: https://omabrowser.org/oma/archives/

Could someone contact them and ask which of those corresponds to 2.0?

rlibouba commented 4 days ago

Hello, It's done : https://github.com/DessimozLab/omamer/issues/44

Have a nice day!

sinamajidian commented 4 days ago

Sorry for the inconvenience. As Adrian mentioned in the OMAmer github issue, this is the correct link for primates https://omabrowser.org/All/Primates.h5

Note that Primates.h5 includes the gene families at the primate level. For a general usage of omamer for mapping input proteins to all protein (gene) families, using the LUCA database is preferred available here https://omabrowser.org/All/LUCA.h5 but it is much bigger and OMAmer needs more resources to run.

Thank you guys for adding it to the Galaxy. Best, Sina

bernt-matthias commented 3 days ago

Thanks for the info. I'm not yet 100% satisfied. The link that you are sharing has no version information ... and we should try to keep it if possible.

In the archives (https://omabrowser.org/oma/archives/) we now find 2 levels of versioning. There are links like https://omabrowser.org/All.Jul2023/Primates-v2.0.0.h5 .. which have a date versioning and the 2.0.0 version. Do you know how this works .. probably there is version 2.0.0 for other dates as well?

sinamajidian commented 3 days ago

Good question. Let me explain the history.

From 2021 to 2023 there was only one "version" for this h5 file. For each OMA browser update (every year or so), new genes are added to the gene families. In such cases the data structure inside the h5 was not changed.

In Oct 2023, there was a major update of database formatting in OMAmer. This new version of OMAmer and the corresponding database is called v2 (see release v2.0.0 https://github.com/DessimozLab/omamer/releases).

Each of the two major versions of OMAmer software (v0 and v2) works only with the corresponding version of OMAmer database. For a transition period, we provided both versions of database (v0 and v2) in h5 format on oma browser (this is the case for Jul 2023 and Nov 2022 ). The goal was that people who were still using the old version of OMAmer (v0) can have access to the old OMAmer db h5 file with old data structure formatting.

Now, the old omamer version (software and database) is obsolete. So, from Jul 2024 which is the current release of OMA browser, we have the database with naming LUCA.h5, this is the most up to date version of the datas tructure (v2) and gene content (Jul 2024).

A user can check the version of OMAmer db h5 file with omamer info -d it outputs the version

  $  wget https://omabrowser.org/All.Jul2023/Primates-v2.0.0.h5
  $ omamer info -d Primates-v2.0.0.h5 
================================================================================
  create timestamp       :              2023-10-20T13:39:05.024706
  database hash          :        27645bb5ca4c995ad24f8459dae3c107
  filter logic           :                                      OR
  include younger fams   :                                    True
  min fam completeness   :                                     0.5
  min fam size           :                                       6
  omamer version         :                                   2.0.0
  root level             :                                Primates
  source                 :                       OMA / All.Jul2023
  k-mer length           :                                       6
  alphabet size          :                                      21
  nr species             :                                      24
  hidden taxa            :                                       -
================================================================================
$ wget https://omabrowser.org/All.Jul2024/Primates.h5 
$ omamer info -d Primates.h5 
================================================================================
  create timestamp       :              2024-08-09T09:56:55.358634
  database hash          :        24aef3e3f6cc9015fd743a5cf867eee1
  filter logic           :                                      OR
  include younger fams   :                                    True
  min fam completeness   :                                     0.5
  min fam size           :                                       6
  omamer version         :                                   2.0.3
  root level             :                                Primates
  source                 :                       OMA / All.Jul2024
  k-mer length           :                                       6
  alphabet size          :                                      21
  nr species             :                                      24
  hidden taxa            :                                       -
================================================================================

when the omamer db of old datastructure is used omamer info -d says

RuntimeError: Database major version mismatch: DB 0.2.5 / OMAmer 2.0.0
Closing remaining open files:LUCA_C.h5...done

Anyway, I agree that the naming is a bit confusing, specially during the transition period. So if you want use the current omamer software, you can use either https://omabrowser.org/All.Jul2024/Primates.h5 or https://omabrowser.org/All/Primates.h5 (both are v2 db). To have a long term link referring to specific gene content, the first link would be best.

I hope this clarify the situation.

bernt-matthias commented 3 days ago

So, our data manager was accessing https://omabrowser.org/All/{dataset}" where dataset is one of

Except for the last two these are also installed on usegalaxy.eu

@rlibouba (@bgruening): Can you run omamer info on these files? Then we will know the source, i.e. the data (I hope). Which would be needed to determine the current links for this data.

@sinamajidian so we could drop the on dataset with v0.2.5? Any comments on the others?

bernt-matthias commented 3 days ago

For the datamanager, we should represent the date (from the link) somewhere.

sinamajidian commented 2 days ago

Assuming the OMAmer v2 databases were downloaded in Nov 2023 (the time that the code was created first), the files that were available at the time with https://omabrowser.org/All/LUCA-v2.0.0.h5 , should be now here https://omabrowser.org/All.Jul2023/LUCA-v2.0.0.h5

Yes, I think it is needed to remove the ones with v0.2.5. (I can also see a similar disscusion here) I noticed you are using omamer as part of the OMArk pipeline. For this purpose, Saccharomyceta should not be used, since one goal of omark is to detect the bacterial contamination in eukaryotic proteomes so OMArk needs to know LUCA proteomes (from OMAmer db).

Related to the discussion, I also noticed LUCA-v0.2.5 is an option on usegalaxy.eu for running omark. I tried it out and as expected it didn't work with OMArk due to version issue. Screenshot 2024-11-28 at 10 31 31 PM Screenshot 2024-11-28 at 10 33 15 PM

Also, note this quote from the OMArk github (@YanNevers).

For all OMArk features to work correctly, it is recommended that this database covers a wide range of species. Thus we recommend using one constructed from the whole OMA database, often called LUCA.h5 . Using a database for a more restricted taxonomic range (Metazoa, Viridiplantae, Primates) would limit the ability of OMArk to detect contamination or to identify sequences of species that belong outside this range.

I'm not sure why a galaxy user would want to run Omark with smaller databases. So I'm suggesting keeping only LUCA-v2.0.0.h5.