Closed minglibio closed 6 days ago
@minglibio Thanks for reporting, I don't think the NotImplementedError
should be happening. Do you happen to have a traceback from this workflow that you could provide?
Regarding running cactus on specific queues, Cactus uses Toil to run it's entire workflow, meaning any Toil argument should apply to the cactus run itself. Those two environment variables should be the only ones you need to set. Although I'm unsure if passing multiple queues to the batch system is supported (this is likely dependent on the batch system and not Toil). If export TOIL_GRIDENGINE_ARGS='-q queue1,queue2'
does not work, you can try one queue like export TOIL_GRIDENGINE_ARGS='-q queue1'
.
➤ Adam Novak commented:
I think part of fixing this might be trawling the GridEngineBatchSystem for any required methods that aren’t implemented. We don’t use actual Grid Engine itself in CI so it’s possible we added a NotInplementedError abstract method to AbstractGridEngineBatchSystem and forgot the implementation here.
Hey! Just as a quick follow-up, I got the exact same error. My exact command was this:
source /.mounts/labs/simpsonlab/users/dsokolowski/projects/annotation_pipeline/external/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/activate
export PATH="/.mounts/labs/simpsonlab/users/dsokolowski/projects/annotation_pipeline/external/cactus-bin-v2.8.4/bin:$PATH"
export TOIL_GRIDENGINE_ARGS='-V -q all.q -P simpsonlab'
cactus jobstore cactus_in.txt target_ref.hal --binariesMode local --maxMemory 24G --realTimeLogging True --batchSystem grid_engine --consCores 4 --workDir /.mounts/labs/simpsonlab/users/dsokolowski/proj
ects/annotation_pipeline/mHetGlaV3_test/sge_cactus_binary
The error is thrown basically immediately. Could it be due to a depricated toil method?
The summary of the toil requirements also seem fine, so I'm not sure why it would have depricated stuff.
Best, Dustin
I'm not sure if this is helpful or relevant but I tried re-installing everything with cactus 2.9 in an otherwise empty environment.
The toil requirements text file is this
backports.zoneinfo[tzdata];python_version<"3.9"
toil[aws]==7.0.0
While I didn't get any notes or warnings, I did see this line in the installation:
Ignoring backports.zoneinfo: markers 'python_version < "3.9"' don't match your environment
Could this be leading to an issue with running toil on SGE?
Best, Dustin
I think the NotImplementedError may be a bit misleading. The function coalesce_job_exit_codes
may be not implemented, but the exception is supposed to be caught. But the with_retries
function logs the exception even if caught. Technically, even though there is a logger error, the default behavior should be working.
The backports message should be fine when running Python 3.9 and above, as the backports.zoneinfo
library only exists to support the zoneinfo
module that was added in Python 3.9's standard library.
pip3 install 'toil @ git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine'
or
pip3 install git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine#egg=toil
You may need to uninstall toil first with pip uninstall toil
. If you need extras, something like
pip3 install 'toil[aws] @ git+https://github.com/DataBiosphere/toil.git@issues/5022-not-implemented-gridengine'
should work.
@DustinSokolowski @minglibio We don't have a SGE cluster here to test on, so please do tell us if the fix works.
Hey!
Thank you for the quick response. So far so good!
It also looks like cactus was able to run with the "singleComputer" option when submitting jobs to the SGE, which is encouraging. I will open up a new issue if I run into issues with cactus using "grid_engine". We're assembling a number of species in the same family tree and are hoping to do an MSA using cactus on our SGE cluster, so it will be nice to get cactus working in both modes.
best, Dustin
Hey @stxue1
Thanks for your help.
After installed the new toil, the NotImplementedError was successfully fixed. But there are still some errors to prevent me from running cactus
.
My command:
#!/bin/bash
#$ -N cactus #task name
#$ -wd /data/scc3/ming.li/software/cactus-bin-v2.8.4/test #work dir
#$ -pe smp 2 #slot
#$ -l h_vmem=10G #memory
#$ -l h_rt=960:00:00 #run time
export TOIL_GRIDENGINE_PE='smp'
export TOIL_GRIDENGINE_ARGS='-q long'
source /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/activate
mkdir -p ./temp
cactus ./js ../examples/evolverMammals.txt ./evolverMammals.hal --workDir ./temp --maxCores 12 --maxMemory 100G --doubleMem true --realTimeLogging True --batchSystem grid_engine
Attached is the log. cactus.e5297895.txt
Hey @minglibio
I was looking through the error file and I noticed that your sge is fully rejecting the first job
Job failed with exit value 2: 'progressive_workflow' kind-progressive_workflow/instance-ilu1jvv0 v1 Exit reason: None
Is there a possibility that there is an incompatibility with your queue commands and toils? For example for us we need to assign a project and a job name or else the cluster rejects us outright (albiet it tells us why). Do you know if you have to set a min hmem or something like that?
another question is export TOIL_GRIDENGINE_ARGS='-q long'
. Can your long queue support the insane number of jobs that cactus runs? Given a mammalian genome submits 10k jobs I think our -q long would kick us off. This being said since cactus is crashing at your first job I doubt it has to do with your forcing "long"
Hey @DustinSokolowski
I tested it, and it can run a job without any parameters. In this way, the job name will be assigned as the shell file name.
I ran the inner example of cactus (only a few species and a few parts of chromosome 5) and I don't expect it will generate so many jobs... I never submit 10k jobs one time, but 1k should be fine in our long
queue.
I noticed that in your last run, you used cactus v2.9
, maybe I need to update my cactus version...
How about your jobs, do they run smoothly with the SGE cluster mode?
Best, Ming
Hey!
Yeah I was sort of trying everything RE: cactus version. I'm not sure it made a difference.
I'm not sure I have a great answer for your question. I haven't made it to the end of a pipeline on grid_engine mode (though I now have on singleMachine). Here's a screenshot of the current log. It's able to run a lot of jobs and some jobs also fail. I think this is the expected behaviour (and toil retries).
The line from the log import: unable to open X server
' @ error/import.c/ImportImageCommand/349.` kind of sounds like it's trying to run a python script as a bash script.
Since the _toil_worker
seems to be invoked on the cactus side as some sort of binary/pointer, this could be more of a cactus issue. @glennhickey @diekhans Does the log look more like a cactus related problem to you?
=========>
import: unable to open X server `' @ error/import.c/ImportImageCommand/349.
import: unable to open X server `' @ error/import.c/ImportImageCommand/349.
/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: line 5: from: command not found
/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: syntax error near unexpected token `('
/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: ` sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])'
<=========
A guess that _toil_worker is being executed as a shell script rather that a python program.
what do the first 10 lines of the file look like on your system?
stxue1 @.***> writes:
The line from the log
import: unable to open X server
' @ error/import.c/ImportImageCommand/349.` kind of sounds like it's trying to run a python script as a bash script.Since the
_toil_worker
seems to be invoked on the cactus side as some sort of binary/pointer, this could be more of a cactus issue. @glennhickey @diekhans Does the log look more like a cactus related problem to you?=========> import: unable to open X server `' @ error/import.c/ImportImageCommand/349. import: unable to open X server `' @ error/import.c/ImportImageCommand/349. /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: line 5: from: command not found /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: syntax error near unexpected token `(' /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: ` sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])' <=========
-- Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/5022#issuecomment-2294096038 You are receiving this because you were mentioned.
Message ID: @.***>
@diekhans
You mean the _toil_worker
?
Here is the file:
#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from toil.worker import main
if __name__ == '__main__':
sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
sys.exit(main())
Reopening as the issue itself isn't fully resolved.
Here is the file:
#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3 # -*- coding: utf-8 -*- import re import sys from toil.worker import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) sys.exit(main())
Looks like it is being executed as a bash/shell script instead of python. The toil worker file matches the logged error. from
is the import on line 5 and sys.argv[0] = re.sub...
is on line 7:
/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: line 5: from: command not found /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: syntax error near unexpected token `(' /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/_toil_worker: _toil_worker: line 7: ` sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])' <=========
Should I open an issue in cactus to resolve this one?
At least I don't think there's much we can do on the Toil side for this specific issue. Though my hunch is that this is more of a configuration issue than a toil/cactus issue. The bash shebang at the top of the script should ensure that the file is ran by the referenced executable when ran as a bash script. Does /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
point to a valid runtime of python? Does a basic python script with the same shebang work? For example by defining a file test.py
:
#!/data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
print("Hello world")
and running chmod +x ./test.py && ./test.py
.
I tested, and all things went well. I also tested the _toil_worker
, and it works well in the login node.
What I will do is try to figure it out with our cluster manager if it is bc the setting of our cluster. I will let you know if we can solve this problem.
I tested, and all things went well. I also tested the
_toil_worker
, and it works well in the login node.
It's likely also worth testing if the _toil_worker
works from all cluster nodes to check that the path is executable/accessible (or if it is a symlink, maybe it cannot be followed through). (Though from local testing these cases should return some other error, so I do still doubt this is an issue. Perhaps something different may happen on the cluster?)
It is bc the system setting of the cluster, all queues on our cluster are set to posix_compliant as shell_start_mode which ignores the #! line. add the -S parameter resolved this issue.
export TOIL_GRIDENGINE_ARGS='-S /data/scc3/ming.li/software/cactus-bin-v2.8.4/venv-cactus-v2.8.4/bin/python'
Hi, I am trying to run
cactus v2.8.4
on a grid_engine cluster with the following command:I am keeping to get the error
NotImplementedError
.Should I do anything to avoid this error?
Other question: In this cluster, we have several different queues. I want to run cactus on some specific queues bc they are faster. Which parameter should I add to achieve this need? Should I add the following environment variables?
Best, Ming
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1617