PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
204 stars 103 forks source link

ERROR:ailed to kill job for heartbeat (which might mean it was already gone) #678

Closed Hongchang-Gu closed 5 years ago

Hongchang-Gu commented 5 years ago

hello, recently I recently encountered a tricky problem when running falcon.I checked all the issue lists and found no similar problems like the following: firstly,I found the error in all.log file, and the error was

[ERROR]Task Node(0-rawreads/tan-runs/tan_045) failed with exit-code=256
[ERROR]Some tasks are recently_done but not satisfied: set([Node(0-rawreads/tan-runs/tan_045)])
[ERROR]ready: set([Node(0-rawreads/tan-runs/tan_032), Node(0-rawreads/tan-runs/tan_029), Node(0-rawreads/tan-runs/tan_023), Node(0-rawreads/tan-runs/tan_044), Node(0-rawreads/tan-runs/tan_010), Node(0-rawreads/tan-runs/tan_000), Node(0-rawreads/tan-runs/tan_015), Node(0-rawreads/tan-runs/tan_038), Node(0-rawreads/tan-runs/tan_037), Node(0-rawreads/tan-runs/tan_009), Node(0-rawreads/tan-runs/tan_035), Node(0-rawreads/tan-runs/tan_008), Node(0-rawreads/tan-runs/tan_016), Node(0-rawreads/tan-runs/tan_026), Node(0-rawreads/tan-runs/tan_034), Node(0-rawreads/tan-runs/tan_011), Node(0-rawreads/tan-runs/tan_018), Node(0-rawreads/tan-runs/tan_014), Node(0-rawreads/tan-runs/tan_006), Node(0-rawreads/tan-runs/tan_012), Node(0-rawreads/tan-runs/tan_021), Node(0-rawreads/tan-runs/tan_028), Node(0-rawreads/tan-runs/tan_042), Node(0-rawreads/tan-runs/tan_005), Node(0-rawreads/tan-runs/tan_041), Node(0-rawreads/tan-runs/tan_017), Node(0-rawreads/tan-runs/tan_031), Node(0-rawreads/tan-runs/tan_040), Node(0-rawreads/tan-runs/tan_020), Node(0-rawreads/tan-runs/tan_036), Node(0-rawreads/tan-runs/tan_043), Node(0-rawreads/tan-runs/tan_024), Node(0-rawreads/tan-runs/tan_007), Node(0-rawreads/tan-runs/tan_039), Node(0-rawreads/tan-runs/tan_004), Node(0-rawreads/tan-runs/tan_025)])
    submitted: set([Node(0-rawreads/tan-runs/tan_003), Node(0-rawreads/tan-runs/tan_046), Node(0-rawreads/tan-runs/tan_001), Node(0-rawreads/tan-runs/tan_022), Node(0-rawreads/tan-runs/tan_002), Node(0-rawreads/tan-runs/tan_013), Node(0-rawreads/tan-runs/tan_027)])
[ERROR]Failed to kill job for heartbeat 'heartbeat-Pc41371aabe9093' (which might mean it was already gone): IOError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/vol6/home/quluj/Software/falcon-2018.08.08/lib/python2.7/site-packages/pwatcher/fs_based.py", line 597, in delete_heartbeat
    bjob.kill(state, heartbeat)
  File "/vol6/home/quluj/Software/falcon-2018.08.08/lib/python2.7/site-packages/pwatcher/fs_based.py", line 272, in kill
    with open(heartbeat_fn) as ifs:
IOError: [Errno 2] No such file or directory: '/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/heartbeats/heartbeat-Pc41371aabe9093'
[ERROR]Failed to kill job for heartbeat 'heartbeat-P3dc087e038236a' (which might mean it was already gone): IOError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/vol6/home/quluj/Software/falcon-2018.08.08/lib/python2.7/site-packages/pwatcher/fs_based.py", line 597, in delete_heartbeat
    bjob.kill(state, heartbeat)
  File "/vol6/home/quluj/Software/falcon-2018.08.08/lib/python2.7/site-packages/pwatcher/fs_based.py", line 272, in kill
    with open(heartbeat_fn) as ifs:
IOError: [Errno 2] No such file or directory: '/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/heartbeats/heartbeat-P3dc087e038236a'
[ERROR]Failed to kill job for heartbeat 'heartbeat-P71fb5af287f7d4' (which might mean it was already gone): IOError(2, 'No such file or directory')

These errors has been repeated many times.

AND then i try find the reason in stdrr ,and we can see these contents:

+ python2.7 -m pwatcher.mains.fs_heartbeat --directory=/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042 --heartbeat-file=/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/heartbeats/heartbeat-P3b2f2a1654d2fb --exit-file=/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/exits/exit-P3b2f2a1654d2fb --rate=10.0 /bin/bash run.sh
Namespace(command=['/bin/bash', 'run.sh'], directory='/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042', exit_file='/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/exits/exit-P3b2f2a1654d2fb', heartbeat_file='/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/heartbeats/heartbeat-P3b2f2a1654d2fb', rate=10.0)

cwd:'/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042'
hostname=cn4495
heartbeat_fn='/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/heartbeats/heartbeat-P3b2f2a1654d2fb'
exit_fn='/vol6/home/quluj/Software/falcon-2018.08.08/mypwatcher/exits/exit-P3b2f2a1654d2fb'
sleep_s=10.0
before setpgid: pid=11844 pgid=24232
 after setpgid: pid=11844 pgid=11844
In cwd: /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042, Blocking call: '/bin/bash run.sh'
export PATH=$PATH:/bin
+ export PATH=/vol6/home/quluj/Software/falcon-2018.08.08/bin:/vol6/software/anaconda2-5.2.0/bin:/vol6/home/quluj/Software/jre1.8.0_111/bin:/vol6/home/quluj/anaconda3/bin:/vol6/home/quluj/qixin/Soft/circos-0.69-6/bin/:/bin:/vol6/home/quluj/anaconda3/bin:/usr/local/mpi-intel2013/bin:/vol6/home/quluj/Software/falcon-2018.08.08/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/vol6/home/quluj/bin:/bin
+ PATH=/vol6/home/quluj/Software/falcon-2018.08.08/bin:/vol6/software/anaconda2-5.2.0/bin:/vol6/home/quluj/Software/jre1.8.0_111/bin:/vol6/home/quluj/anaconda3/bin:/vol6/home/quluj/qixin/Soft/circos-0.69-6/bin/:/bin:/vol6/home/quluj/anaconda3/bin:/usr/local/mpi-intel2013/bin:/vol6/home/quluj/Software/falcon-2018.08.08/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/vol6/home/quluj/bin:/bin
cd /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042
+ cd /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042
/bin/bash task.sh
+ /bin/bash task.sh
pypeflow 2.0.4+git.005acb16689c18c09cf552b42911e69629ffeceb
2018-12-05 00:27:43,299 - root - DEBUG - Running "/vol6/home/quluj/Software/falcon-2018.08.08/lib/python2.7/site-packages/pypeflow/do_task.py /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042/task.json"
2018-12-05 00:27:43,301 - root - DEBUG - Checking existence of '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042/task.json' with timeout=30
2018-12-05 00:27:43,301 - root - DEBUG - Loading JSON from '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042/task.json'
2018-12-05 00:27:43,304 - root - DEBUG - {u'bash_template_fn': u'template.sh',
 u'inputs': {u'all': u'/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-split/tan-uows.json'},
 u'outputs': {u'one': u'some-units-of-work.json'},
 u'parameters': {u'pypeflow_mb': 4000, u'pypeflow_nproc': 1, u'split_idx': 42}}
2018-12-05 00:27:43,304 - root - WARNING - CD: '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042' <- '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042'
2018-12-05 00:27:43,304 - root - DEBUG - Checking existence of u'/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-split/tan-uows.json' with timeout=30
2018-12-05 00:27:43,305 - root - DEBUG - Checking existence of u'template.sh' with timeout=30
2018-12-05 00:27:43,307 - root - WARNING - CD: '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042' <- '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042'
2018-12-05 00:27:43,312 - root - INFO - $('/bin/bash user_script.sh')
hostname
+ hostname
pwd
+ pwd
date
+ date
# Substitution will be similar to snakemake "shell".
python -m falcon_kit.mains.generic_scatter_one_uow --all-uow-list-fn=/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-split/tan-uows.json --one-uow-list-fn=some-units-of-work.json --split-idx=42
+ python -m falcon_kit.mains.generic_scatter_one_uow --all-uow-list-fn=/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-split/tan-uows.json --one-uow-list-fn=some-units-of-work.json --split-idx=42
falcon-kit 1.2.2+git.00e8272b663d32a0962ae92ab92324a3b3eb4b46
pypeflow 2.0.4+git.005acb16689c18c09cf552b42911e69629ffeceb

date
+ date
2018-12-05 00:27:52,683 - root - DEBUG - Call '/bin/bash user_script.sh' returned 0.
2018-12-05 00:27:52,683 - root - WARNING - CD: '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042' -> '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042'
2018-12-05 00:27:52,683 - root - DEBUG - Checking existence of u'some-units-of-work.json' with timeout=30
2018-12-05 00:27:52,684 - root - WARNING - CD: '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042' -> '/vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042'

real    0m18.620s
user    0m1.180s
sys 0m0.793s
touch /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042/run.sh.done
+ touch /vol6/home/quluj/Software/falcon-2018.08.08/0-rawreads/tan-chunks/tan_042/run.sh.done
 returned: 0

Now I don't know how to solve it,I hope to get help from everyone! THANKS A LOT!

Hongchang-Gu commented 5 years ago

i will appreciate it if someone can help me solve this problem!

gconcepcion commented 5 years ago

Hi,

Can you please share the contents of your fc_run.cfg.

This is the line in particular that i'm interested in: pa_HPCTANmask_option =

Hongchang-Gu commented 5 years ago

I am sorry to reply you two days later after your response.

I did not set pa_HPCTANmask_option = in my fc_run.cfg

and this is my fc_run.cfg:

[General]
input_fofn = input.fofn

input_type = raw

job_type=local

length_cutoff = 10000
genome_size = 1200000000
length_cutoff_pr = 10000

sge_option_da = -pe smp 4 -q bigmem
sge_option_la = -pe smp 20 -q bigmem
sge_option_pda = -pe smp 6 -q bigmem
sge_option_pla = -pe smp 16 -q bigmem
sge_option_fc = -pe smp 24 -q bigmem
sge_option_cns = -pe smp 8 -q bigmem

pa_concurrent_jobs = 64
cns_concurrent_jobs = 64
ovlp_concurrent_jobs = 64

pa_HPCdaligner_option =  -v -B128 -M24
ovlp_HPCdaligner_option = -v -B128 -M24
pa_daligner_option   = -e.75 -l3200 -k18 -h480  -w8 -s100
ovlp_daligner_option = -e.96 -l2500 -k24 -h1024 -w6 -s100

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -s200

falcon_sense_option = --output_multi --output_dformat --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 8
falcon_sense_skip_contained = True
overlap_filtering_setting = --max_diff 120 --max_cov 120 --min_cov 2 --n_core 6

thank you!

Hongchang-Gu commented 5 years ago

hello, the information of my fc_run.cfg had already shown in github. I hope you can take a look at this issue at your convenience. thx!

在2018-12-08 03:06:26,Gregnotifications@github.com写道:

Hi,

Can you please the contents of your fc_run.cfg.

This is the line in particular that i'm interested in: pa_HPCTANmask_option =

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

suvi93 commented 5 years ago

Any update as to how to go about this error? I'm getting the same error as well.

2019-02-19 08:11:58,284 - pwatcher.fs_based:600 - ERROR - Failed to kill job for heartbeat 'heartbeat-Pd905dffe513485' (which might mean it was already gone): IOError(2, 'No such file or directory') Traceback (most recent call last): File "/home/ssubha/software/falcon/falcon-2018.31.08-03.06/lib/python2.7/site-packages/pwatcher/fs_based.py", line 597, in delete_heartbeat bjob.kill(state, heartbeat) File "/home/ssubha/software/falcon/falcon-2018.31.08-03.06/lib/python2.7/site-packages/pwatcher/fs_based.py", line 272, in kill with open(heartbeat_fn) as ifs: IOError: [Errno 2] No such file or directory: '/home/ssubha/Suvratha/mosquito/pacbiodata/falcon/mypwatcher/heartbeats/heartbeat-Pd905dffe513485' 2019-02-19 08:11:58,285 - pwatcher.fs_based:607 - DEBUG - Cannot remove heartbeat '/home/ssubha/Suvratha/mosquito/pacbiodata/falcon/mypwatcher/heartbeats/heartbeat-Pd905dffe513485': OSError(2, 'No such file or directory')

Hongchang-Gu commented 5 years ago

Sorry, I have not solved this problem so far. At present, because of other data processing, this error is temporarily put on hold. I hope that if you can solve the problem, you can notify me. THANKS A LOT!

在2019-02-19 16:33:57,suvi93notifications@github.com写道:

Any update as to how to go about this error? I'm getting the same error as well.

2019-02-19 08:11:58,284 - pwatcher.fs_based:600 - ERROR - Failed to kill job for heartbeat 'heartbeat-Pd905dffe513485' (which might mean it was already gone): IOError(2, 'No such file or directory') Traceback (most recent call last): File "/home/ssubha/software/falcon/falcon-2018.31.08-03.06/lib/python2.7/site-packages/pwatcher/fs_based.py", line 597, in delete_heartbeat bjob.kill(state, heartbeat) File "/home/ssubha/software/falcon/falcon-2018.31.08-03.06/lib/python2.7/site-packages/pwatcher/fs_based.py", line 272, in kill with open(heartbeat_fn) as ifs: IOError: [Errno 2] No such file or directory: '/home/ssubha/Suvratha/mosquito/pacbiodata/falcon/mypwatcher/heartbeats/heartbeat-Pd905dffe513485' 2019-02-19 08:11:58,285 - pwatcher.fs_based:607 - DEBUG - Cannot remove heartbeat '/home/ssubha/Suvratha/mosquito/pacbiodata/falcon/mypwatcher/heartbeats/heartbeat-Pd905dffe513485': OSError(2, 'No such file or directory')

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

pb-cdunn commented 5 years ago

First, DEBUG is not an error. Might be fine. If it crashes for a specific task, look for a stderr file in the run-dir for that task.

Also, please switch to pwatcher_type = block. It is very difficult to support fs_based.

https://github.com/PacificBiosciences/pypeFLOW/wiki/configuration