running OmaStandalone without network access

EricDeveaud commented 2 years ago

Hello, our cluster compute nodes does not have access to internet, so oma fails while trying to download at first http://purl.obolibrary.org/obo/go.obo

I may execute a run on a machine that have internet access and provide the $HOME/.cache/oma/GOdata.drw or for our users

but I saw that darwinlib/Taxonomy also perform a download from http://www.uniprot.org/taxonomy/?query=*&compress=yes&format=tab is there a way that I can download and process (ConvertRawFile) this file and provide the resulting UniProtTaxonomy.drw file to our users in order to be abble to run oma without internet access. this way oma will be really Standalone ;-) regards

Eric

edit typo

EricDeveaud commented 2 years ago

NB running https://omabrowser.org/standalone/OMA.2.5.0.tgz

alpae commented 2 years ago

Hi Eric,

the uniprot taxonomy is only needed in very special config settings, e.g. with DoHierarchicalGroups := 'top-down';. So usually this is not needed. If you want to make sure that OmaStandalone is able to run without access to the internet in any configuration, you can download and convert the taxonomy with the following command:

bin/omadarwin -E << EOF
     datadirname := getenv('HOME').'/.cache/oma2';
     CallSystem('mkdir -p '.datadirname);
     GOdownload();
     TaxonomyDownload();
EOF

This should create all the necessary files in the ~/.cache/oma folder of the current user (Gene Ontology and UniProt Taxonomy).

Cheers Adrian

EricDeveaud commented 2 years ago

Hi Adiran.

many thanks for the input.

best regards

Eric

EricDeveaud commented 2 years ago

hi Adiran. works well I was abble to get the necessary files thanks again.

I have few more question

1) from ToyExample/parameters.drw one can read:

# Folder where auxillary data (e.g. GeneOntology definitions, etc)
# will be stored. The folder must be writable by the user. If not set
# or commented, the default will be ~/.cache/oma/
AuxDataPath := 'data/';

if I understand right when AuxDataPath is set on parameters.drw file it superseed datadirname set on $omadir/darwinlib/darwinit is this right ?

2) and when it is said The folder must be writable by the user. is there any other files than GOdata.drw.gz and UniProtTaxonomy.drw.gz that will be stored to this directory ? I ask because the installation scheme on our cluster is done on Read Only shared file system, So i must be sure that I can host the files on this one. If not I will have to provide some solution for users to be abble to store the required files

best regards

Eric

alpae commented 2 years ago

Hi Eric,

indeed, when you set AuxDataPath in the parameter file, this superseeds the default datadirname. The two files (and two symlinks) are the only files that are used from this folder. So in principle I think it would be ok to set the an absolute path for AuxDataPath in the parameters.drw file in the installation folder. when users generate a new parameter file for their project with oma -p, that path will already be set and used.

However, maybe it would be more sensible to have an environment variable that can be set as default. then, we could have set the path to these auxiliary data like:

set in parameter file (AuxDataPath parameter)
set to path from an environment variable if set
use ~/.cache/oma as fall-back

would that make sense from your point and would simplify setting up the package on an HPC system?

Cheers Adrian

EricDeveaud commented 2 years ago

thanks Adrian, I endend with the same schema that you describe.

the "default" parametes.drw I provide have the following AuxDataPah set like this. AuxDataPath := getenv('OMA_DATA');

and on OMA_DATA path we provide the GOdata.drw.gz and UniProtTaxonomy.drw.gz files

and it seems to work

can you provide me some information about the ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/docs/speclist.txt url used in darwinlib/TaxTools library ?

and finaly are you the author of Darwin ? I would suggest to embed a private copy of darwin libs where GetTmpDir() from Wrappers/Common instead is used instead of having '/tmp/ hardcoded on multiples places. many cluster out there set a TMPDIR environment variable that points to fast scratch location instead of usual /tmp

best regards

Eric

alpae commented 2 years ago

Hi Eric,

yes, that seems like a good setup.

regarding your darwin questions: yes, I am a co-author of that language. The darwinlib/TaxTools functionality isn't needed by OmaStandalone at all, so you won't need to download that data.

about hardcoded /tmp dir - where did you find that? I don't think that this is used anywhere. The GetTmpDir() function actually already uses the TMPDIR environment variable...

EricDeveaud commented 2 years ago

Adrian,

thanks for your feedback

regarding the use of hardcoded /tmp in darwinlib you may find it just by doing

rpm_maker:src/OMA > wget -q https://omabrowser.org/standalone/OMA.2.5.0.tgz 
rpm_maker:src/OMA > tar xf OMA.2.5.0.tgz 
rpm_maker:src/OMA > cd OMA.2.5.0/darwinlib/
rpm_maker:OMA.2.5.0/darwinlib > grep -Rl '/tmp' 
Wrappers/Common
FigPlot
Plot2Gif
ParExecSlave
FileConv
Descriptions
Server/MassDynSearch
Server/TreeGen
Server/MassSearch
Server/TreeConstruction
Server/AllAll
Server/PepPepSearch
Server/TestNewFunction1
Server/MultAlign
Server/cbrg.server
Server/Gendb
Server/AllAllDB
Server/TestNewFunction
Server/mail_handler
Server/PredictGenes
Server/NuclPepSearch
Server/EvolutionaryAnalysis
Ontology
ParExec2
MBA_Toolkit
Taxonomy
MySQL
IPC
DBTools
HelpText.txt

I guess some of this library files are not used by OMA. but some are ;-)

regards

Eric

alpae commented 2 years ago

Hi Eric,

indeed, there are quite a few places in the darwinlib, but in OmaStandalone, only the function in Taxonomy and Ontology are used. I will make an attempt to update these functions before the next OmaStandalone release. Thanks for your valuable feedback!

Best wishes Adrian

DessimozLab / OmaStandalone

running OmaStandalone without network access #5