Open bernt-matthias opened 1 week ago
Hello, It's done : https://github.com/DessimozLab/omamer/issues/44
Have a nice day!
Sorry for the inconvenience. As Adrian mentioned in the OMAmer github issue, this is the correct link for primates https://omabrowser.org/All/Primates.h5
Note that Primates.h5 includes the gene families at the primate level. For a general usage of omamer for mapping input proteins to all protein (gene) families, using the LUCA database is preferred available here https://omabrowser.org/All/LUCA.h5 but it is much bigger and OMAmer needs more resources to run.
Thank you guys for adding it to the Galaxy. Best, Sina
Thanks for the info. I'm not yet 100% satisfied. The link that you are sharing has no version information ... and we should try to keep it if possible.
In the archives (https://omabrowser.org/oma/archives/) we now find 2 levels of versioning. There are links like https://omabrowser.org/All.Jul2023/Primates-v2.0.0.h5 .. which have a date versioning and the 2.0.0 version. Do you know how this works .. probably there is version 2.0.0 for other dates as well?
Good question. Let me explain the history.
From 2021 to 2023 there was only one "version" for this h5
file. For each OMA browser update (every year or so), new genes are added to the gene families. In such cases the data structure inside the h5
was not changed.
In Oct 2023, there was a major update of database formatting in OMAmer. This new version of OMAmer and the corresponding database is called v2 (see release v2.0.0 https://github.com/DessimozLab/omamer/releases).
Each of the two major versions of OMAmer software (v0 and v2) works only with the corresponding version of OMAmer database.
For a transition period, we provided both versions of database (v0 and v2) in h5 format on oma browser (this is the case for Jul 2023 and Nov 2022 ). The goal was that people who were still using the old version of OMAmer (v0) can have access to the old OMAmer db h5
file with old data structure formatting.
Now, the old omamer version (software and database) is obsolete. So, from Jul 2024 which is the current release of OMA browser, we have the database with naming LUCA.h5
, this is the most up to date version of the datas tructure (v2) and gene content (Jul 2024).
A user can check the version of OMAmer db h5 file with omamer info -d
it outputs the version
$ wget https://omabrowser.org/All.Jul2023/Primates-v2.0.0.h5
$ omamer info -d Primates-v2.0.0.h5
================================================================================
create timestamp : 2023-10-20T13:39:05.024706
database hash : 27645bb5ca4c995ad24f8459dae3c107
filter logic : OR
include younger fams : True
min fam completeness : 0.5
min fam size : 6
omamer version : 2.0.0
root level : Primates
source : OMA / All.Jul2023
k-mer length : 6
alphabet size : 21
nr species : 24
hidden taxa : -
================================================================================
$ wget https://omabrowser.org/All.Jul2024/Primates.h5
$ omamer info -d Primates.h5
================================================================================
create timestamp : 2024-08-09T09:56:55.358634
database hash : 24aef3e3f6cc9015fd743a5cf867eee1
filter logic : OR
include younger fams : True
min fam completeness : 0.5
min fam size : 6
omamer version : 2.0.3
root level : Primates
source : OMA / All.Jul2024
k-mer length : 6
alphabet size : 21
nr species : 24
hidden taxa : -
================================================================================
when the omamer db of old datastructure is used omamer info -d
says
RuntimeError: Database major version mismatch: DB 0.2.5 / OMAmer 2.0.0
Closing remaining open files:LUCA_C.h5...done
Anyway, I agree that the naming is a bit confusing, specially during the transition period. So if you want use the current omamer software, you can use either https://omabrowser.org/All.Jul2024/Primates.h5 or https://omabrowser.org/All/Primates.h5 (both are v2 db). To have a long term link referring to specific gene content, the first link would be best.
I hope this clarify the situation.
So, our data manager was accessing https://omabrowser.org/All/{dataset}"
where dataset
is one of
"Primates-v2.0.0.h5"
"Viridiplantae-v2.0.0.h5"
"Metazoa-v2.0.0.h5"
"LUCA-v0.2.5.h5"
"LUCA-v2.0.0.h5"
"Saccharomyceta.h5"
"Homininae.h5"
Except for the last two these are also installed on usegalaxy.eu
@rlibouba (@bgruening): Can you run omamer info
on these files? Then we will know the source, i.e. the data (I hope). Which would be needed to determine the current links for this data.
@sinamajidian so we could drop the on dataset with v0.2.5
? Any comments on the others?
For the datamanager, we should represent the date (from the link) somewhere.
Assuming the OMAmer v2 databases were downloaded in Nov 2023 (the time that the code was created first), the files that were available at the time with https://omabrowser.org/All/LUCA-v2.0.0.h5 , should be now here https://omabrowser.org/All.Jul2023/LUCA-v2.0.0.h5
Yes, I think it is needed to remove the ones with v0.2.5. (I can also see a similar disscusion here)
I noticed you are using omamer as part of the OMArk pipeline. For this purpose, Saccharomyceta
should not be used, since one goal of omark is to detect the bacterial contamination in eukaryotic proteomes so OMArk needs to know LUCA proteomes (from OMAmer db).
Related to the discussion, I also noticed LUCA-v0.2.5 is an option on usegalaxy.eu for running omark. I tried it out and as expected it didn't work with OMArk due to version issue.
Also, note this quote from the OMArk github (@YanNevers).
For all OMArk features to work correctly, it is recommended that this database covers a wide range of species. Thus we recommend using one constructed from the whole OMA database, often called LUCA.h5 . Using a database for a more restricted taxonomic range (Metazoa, Viridiplantae, Primates) would limit the ability of OMArk to detect contamination or to identify sequences of species that belong outside this range.
I'm not sure why a galaxy user would want to run Omark with smaller databases. So I'm suggesting keeping only LUCA-v2.0.0.h5
.
data_manager_omamer
tests fail in weekly CI:ping @rlibouba @SaimMomin12
They seem to have switched the URL schema: https://omabrowser.org/oma/archives/
Could someone contact them and ask which of those corresponds to 2.0?