Process stops in the middle, no suitable queues

Double-O-ren commented 9 years ago

I ran gk using tmux and detached the window during the process, it stopped the process in the middle. Now that I want to restart the process, I get this error:

genomekey -d germline -n 'Test_BRCA' /genomekey/share/test/brca/input_s3.tsv --target_bed /genomekey/share/test/brca/targets.bed /home/genomekey/projects/GenomeKey/ve/local/lib/python2.7/site-packages/flask_sqlalchemy/__init__.py:800: UserWarning: SQLALCHEMY_TRACK_MODIFICATIONS adds significant overhead and will be disabled by default in the future. Set it to True to suppress this warning. warnings.warn('SQLALCHEMY_TRACK_MODIFICATIONS adds significant overhead and will be disabled by default in the future. Set it to True to suppress this warning.') Resuming <Execution[1] Test_BRCA>. All non-successful jobs will be deleted, then any new tasks in the graph will be added and executed. Are you sure? [n]|y:

If I click 'y' I get another error and land in the debugger:

INFO: 2015-10-30 18:15:57: Resuming <Execution[1] Test_BRCA> INFO: 2015-10-30 18:15:57: Deleting 10 unsuccessful task(s) from SQL database, delete_files=False INFO: 2015-10-30 18:15:57: Preparing to run <Execution[1] Test_BRCA> using DRM drmaa:ge, output_dir: /genomekey/analysis/Germline/Test_BRCA INFO: 2015-10-30 18:15:57: <Stage[3] Download_Fastqs_From_S3> Finished successfully INFO: 2015-10-30 18:15:57: <Stage[4] Fastqc> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[1] Copy_Target_Bed> Finished successfully INFO: 2015-10-30 18:15:57: <Stage[2] Filter_Bed_By_Contig> Finished successfully INFO: 2015-10-30 18:15:57: <Stage[5] Cut_Adapt> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[6] Bwa_Mem> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[7] Mark_Duplicates> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[8] Realigner_Target_Creator> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[9] Indel_Realigner> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[10] Merge_Sample_Bams> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[11] Haplotype_Caller> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[12] Combine_Gvcfs> Has not been attempted INFO: 2015-10-30 18:15:57: <Stage[13] Genotype_Gvcfs> Has not been attempted INFO: 2015-10-30 18:15:57: Skipping 4 successful tasks... INFO: 2015-10-30 18:15:57: Committing to SQL db... INFO: 2015-10-30 18:15:57: Executing TaskGraph `DeniedByDrmException('code 17: error: no suitable queues',)

/home/genomekey/projects/GenomeKey/ve/local/lib/python2.7/site-packages/drmaa/errors.py(151)error_check() 150 try: --> 151 raise _ERRORScode - 1 152 except IndexError:

ipdb> ` If I click 'n' then it quits and nothing happens.

I killed the job through the graphical interface and restarted. after completing the download, I get the same error (redacted):

`DeniedByDrmException('code 17: error: no suitable queues',)

/home/genomekey/projects/GenomeKey/ve/local/lib/python2.7/site-packages/drmaa/errors.py(151)error_check() 150 try: --> 151 raise _ERRORScode - 1 152 except IndexError: ipdb>`

What can I do in this case? I'll reboot the cluster as a last resort.

egafni commented 9 years ago

Agh, cosmos is submitting to the wrong queue. This is my fault (i changed cosmos' default). The problem was caused by me not pinning the versions of libraries installed by GenomeKey. When you launch the cluster, StarClusterExtensions does pip install genomekey, which installs all of it's requirements from the PIP cheese shop. So it grabs for example the latest version of cosmos. What I need to do is pin every version of GenomeKey's requirements to the one we know is working (example: install cosmos-wfm==0.5.1).

You have a few options, both are good ways to start learning how to modify Cosmos/GenomeKey code: pip install an older version of cosmos-wfm: look for the commit where i changed the function default_get_submit_args() and pip install the version just before that.

Alter GenomeKey and manually specify the get_submit_args() passed to Execution.run() to submit to the proper queue.

(Rebooting will not help)

Double-O-ren commented 9 years ago

I found that the worker node was not correctly attached to the cluster after running listclusters. I terminated the cluster and started a fresh one, made sure that there was at least one node attached and the demo worked. My assumption is that the error was in restarting the cluster using the -x option starcluster -c etc/starcluster.config start -x gk && starcluster -c etc/starcluster.config addnode gk

yassineS commented 9 years ago

In my experience, restarting starcluster doesn’t reliably restart clusters, something definitely to avoid.

-- Yassine Souilmi

On November 2, 2015 at 6:30:15 PM, Oren Schaedel (notifications@github.com) wrote:

I found that the worker node was not correctly attached to the cluster after running listclusters. I terminated the cluster and restarted, made sure that there was at least one node attached and the demo worked. My assumption is that the error was in restarting the cluster using the -x option starcluster -c etc/starcluster.config start -x gk && starcluster -c etc/starcluster.config addnode gk

— Reply to this email directly or view it on GitHub.

egafni commented 9 years ago

Great!

On Mon, Nov 2, 2015 at 10:33 AM, Yassine Souilmi notifications@github.com wrote:

In my experience, restarting starcluster doesn’t reliably restart clusters, something definitely to avoid.

Yassine Souilmi

On November 2, 2015 at 6:30:15 PM, Oren Schaedel (notifications@github.com) wrote:

I found that the worker node was not correctly attached to the cluster after running listclusters. I terminated the cluster and restarted, made sure that there was at least one node attached and the demo worked. My assumption is that the error was in restarting the cluster using the -x option starcluster -c etc/starcluster.config start -x gk && starcluster -c etc/starcluster.config addnode gk

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/LPM-HMS/COSMOS-2.0/issues/24#issuecomment-153118060.

egafni commented 9 years ago

@Double-O-ren in the future please post GenomeKey issues to the GenomeKey repository

Mizzou-CBMI / COSMOS2

Process stops in the middle, no suitable queues #24