Data Release : Make Solr Indexer handle nextflow resume on failure

sajo-ebi commented 1 month ago

Currently as part of Daat release whenever the Solr indexing fails in between , we have to manually trigger nextflow resume command , this is not very efficient we lose lot of productivity especially when DR fails in non working hours or weekends . We need to handle the resume command in Python caller of the nextflow . The nexflow is calling using the below method

indexer-manager --newInstance spotrel --oldInstance spotpub --solrHost http://gwas-garfield --solrCore gwas --solrPort 8983 --wrapperScript /hps/software/users/parkinso/spot/gwas/prod/sw/solr-indexer/new_solr_wrapper.sh --logFolder /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing --fullIndex

karatugo commented 1 month ago

[x] Merge pull request https://github.com/EBISPOT/gwas-utils/pull/175
[x] Release new gwas-utils (github tag and gitlab release) -> https://github.com/EBISPOT/gwas-utils/releases/tag/0.1.27b
[x] Update conda envs with the new version
[x] Test okay with fake data
[x] Use in the next data release

karatugo commented 1 month ago

to discuss if the test can be run during this data release or before

karatugo commented 1 month ago

How test is conducted with fake data: In the conda env called goci-1383, install the new gwas-utils package locally (git clone and pip install). Then run the following command and inspect the logs.

(goci-1383) [spotbot@codon-dm-01 goci-1383]$ indexer-manager --logFolder /hps/nobackup/parkinso/spot/gwas/scratch/goci-1383
Running nextflow: false
First attempt failed with error: Command '['false']' returned non-zero exit status 1.. Retrying with -resume option.

sprintell commented 1 month ago

Wait for the next data release to know if this works

karatugo commented 1 month ago

This was included in the last data release but as far as I can see there was no failed step at Solr export. So I doubt this was used.

sajo-ebi commented 2 weeks ago

@karatugo this was executed in last data release can you check if it worked correctly

karatugo commented 1 week ago

According to logs, I can confirm that -resume worked but it's not correct.

$ cat /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/nextflow.log
Sep-09 14:01:26.911 [main] DEBUG nextflow.cli.Launcher - Setting http proxy: ProxyConfig[protocol=http; host=www-proxy.ebi.ac.uk; port=3128]
Sep-09 14:01:27.339 [main] DEBUG nextflow.cli.Launcher - Setting https proxy: ProxyConfig[protocol=https; host=www-proxy.ebi.ac.uk; port=3128]
Sep-09 14:01:27.339 [main] DEBUG nextflow.cli.Launcher - $> nextflow -log /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/nextflow.log run /hps/software/users/parkinso/spot/gwas/anaconda3/envs/gwas-utils/nf/solr_indexing.nf --job_map_file /hps/nobackup/parkinso/spot/gwas/logs/solr_indexing/job_map.csv -resume
Sep-09 14:01:27.616 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 21.10.6

Unfortunately this caused another error in Nextflow.

Sep-09 14:01:29.890 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 2; maxThreads: 1000
Sep-09 14:01:29.955 [main] ERROR nextflow.cli.Launcher - Unable to acquire lock on session with ID 23e44f5f-0ac7-43d8-ba29-5c4db15a6b6a

Common reasons of this error are:
 - You are trying to resume the execution of an already running pipeline
 - A previous execution was abruptly interrupted leaving the session open

ala-ebi commented 1 week ago

the run where the indexing job failed despite resume kicking-in is caused by the indexer itself as it was expecting a new field in solr which was absent at the time of the run, so it's expected to fail. in the last run, there were failures and upon checking the logs i saw that the resume ran and after a while the indexer job finished, so from what i can see it's working

karatugo commented 1 day ago

Discussed with Ala about repeated resuming. We agreed to make the following changes. It seems not urgent, so only creating its ticket for now.

Wait for 1h before every resume attempt (so that the job scheduler error gets resolved)
Only resume a few times (to avoid repeated resuming)

See https://github.com/EBISPOT/goci/issues/1431

EBISPOT / goci

Data Release : Make Solr Indexer handle nextflow resume on failure #1383