EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Data Release : Make Solr Indexer handle nextflow resume on failure #1383

Closed sajo-ebi closed 11 hours ago

sajo-ebi commented 1 month ago

Currently as part of Daat release whenever the Solr indexing fails in between , we have to manually trigger nextflow resume command , this is not very efficient we lose lot of productivity especially when DR fails in non working hours or weekends . We need to handle the resume command in Python caller of the nextflow . The nexflow is calling using the below method

indexer-manager --newInstance spotrel --oldInstance spotpub --solrHost http://gwas-garfield --solrCore gwas --solrPort 8983 --wrapperScript /hps/software/users/parkinso/spot/gwas/prod/sw/solr-indexer/new_solr_wrapper.sh --logFolder /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing --fullIndex

karatugo commented 1 month ago
karatugo commented 1 month ago

to discuss if the test can be run during this data release or before

karatugo commented 1 month ago

How test is conducted with fake data: In the conda env called goci-1383, install the new gwas-utils package locally (git clone and pip install). Then run the following command and inspect the logs.

(goci-1383) [spotbot@codon-dm-01 goci-1383]$ indexer-manager --logFolder /hps/nobackup/parkinso/spot/gwas/scratch/goci-1383
Running nextflow: false
First attempt failed with error: Command '['false']' returned non-zero exit status 1.. Retrying with -resume option.
sprintell commented 1 month ago

Wait for the next data release to know if this works

karatugo commented 1 month ago

This was included in the last data release but as far as I can see there was no failed step at Solr export. So I doubt this was used.

sajo-ebi commented 2 weeks ago

@karatugo this was executed in last data release can you check if it worked correctly

karatugo commented 1 week ago

According to logs, I can confirm that -resume worked but it's not correct.

$ cat /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/nextflow.log
Sep-09 14:01:26.911 [main] DEBUG nextflow.cli.Launcher - Setting http proxy: ProxyConfig[protocol=http; host=www-proxy.ebi.ac.uk; port=3128]
Sep-09 14:01:27.339 [main] DEBUG nextflow.cli.Launcher - Setting https proxy: ProxyConfig[protocol=https; host=www-proxy.ebi.ac.uk; port=3128]
Sep-09 14:01:27.339 [main] DEBUG nextflow.cli.Launcher - $> nextflow -log /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/nextflow.log run /hps/software/users/parkinso/spot/gwas/anaconda3/envs/gwas-utils/nf/solr_indexing.nf --job_map_file /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/job_map.csv -resume
Sep-09 14:01:27.616 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 21.10.6

Unfortunately this caused another error in Nextflow.

Sep-09 14:01:29.890 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 2; maxThreads: 1000
Sep-09 14:01:29.955 [main] ERROR nextflow.cli.Launcher - Unable to acquire lock on session with ID 23e44f5f-0ac7-43d8-ba29-5c4db15a6b6a

Common reasons of this error are:
 - You are trying to resume the execution of an already running pipeline
 - A previous execution was abruptly interrupted leaving the session open
ala-ebi commented 1 week ago

the run where the indexing job failed despite resume kicking-in is caused by the indexer itself as it was expecting a new field in solr which was absent at the time of the run, so it's expected to fail. in the last run, there were failures and upon checking the logs i saw that the resume ran and after a while the indexer job finished, so from what i can see it's working

karatugo commented 1 day ago

Discussed with Ala about repeated resuming. We agreed to make the following changes. It seems not urgent, so only creating its ticket for now.

See https://github.com/EBISPOT/goci/issues/1431