PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

PBS w/ blocking pwatcher #502

Open hodgett opened 7 years ago

hodgett commented 7 years ago

was: Process getting stuck now at fc_consensus

We have a large cluster that we are running FALCON on (250+ nodes) and now keep running in to new problems, some due to the configuration of the carious clusters we have tried using, some due to the software. Two weeks ago we ran some mid-size samples okay and so scaled up, but now no run goes to conclusion.

Last week we were getting the EXIT 256 problem during the first daligner process and so, after reading the issues register, I upgraded to the current version. Now it won't get past fc_consensus. I find that fc_consensus jobs nearly always hang on the node with 0 CPU, usually with no stderr or stdout logs, sometimes stdout says it spawned the workers for fc_consensus and that is all. Some times I get an EXIT 99 and the following is all that is in stderr;

+ python2.7 -m pwatcher.mains.fs_heartbeat --directory=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/0-rawreads/preads/cns_00006 --heartbeat-file=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/mypwatcher/heartbeats/heartbeat-P1d481d812852fa --exit-file=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/mypwatcher/exits/exit-P1d481d812852fa --rate=10.0 /bin/bash run.sh
/var/spool/PBS/mom_priv/jobs/1378550.pbs.SC: line 6: 11445 Illegal instruction     python2.7 -m pwatcher.mains.fs_heartbeat --directory=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/0-rawreads/preads/cns_00006 --heartbeat-file=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/mypwatcher/heartbeats/heartbeat-P1d481d812852fa --exit-file=/lustre/work-lustre/banana_team/Falcon_Test-20161220c/mypwatcher/exits/exit-P1d481d812852fa --rate=10.0 /bin/bash run.sh
+ echo 99

Our system runs SUSE12 and PBS Pro, python 2.7.11.

I have attached the fc_run and pbs script in configs.zip, and a successful fc_consensus job that has the environment variables in Pd7d3d71c1e5923.zip.

I would appreciate some guidance as to what could be wrong, and how we could figure out what is going wrong. Let me know if I am missing some information.

configs.zip Pd7d3d71c1e5923.zip

pb-cdunn commented 7 years ago

the node with 0 CPU

What does that mean? The node where all CPUs have been fully utilized already?

I highly recommend switching to blocking calls for job-submission. What job-submission system are you using? What are your sge_option lines?

hodgett commented 7 years ago

0 CPU means 0% activity - nothing happening as apposed to 100% CPU. Job submission system is PBS pro sge lines are in the attached configs.

How would I switch to blocking calls for job submission with PBS pro?

lijingjing1 commented 7 years ago

In step fc_consensus the CPU use ration almost 0% may be the IO problem,i see you set the length_cutoff to be 1500 and the cns_concurrent_jobs number to 200.The length_cutoff too short may be cost a lot of computation time to dalinger and call consensus and the cns job number too large may cause IO jam.Also you can try set block size smaller.

hodgett commented 7 years ago

lijingjing1, I have checked IO and it's not a problem at all. I do think the length_cutoff could be tuned, but it is definitely not causing this problem either. The block size is only going to get bigger as we throw bigger datasets at the cluster, reducing it is not an option although I have already tried that for testing purposes.

Since this last did a successful job on 10th December and not after I can only say that there is likely a bug introduced in the last update that has broken fc_consensus. I did notice that the last code update changes they way job_queue is selected. It doesn't explain why I get a couple of fc_consensus jobs finishing, and yet the rest hang with no output.

hodgett commented 7 years ago

Ok, I now believe that all the problems started at the time FALCON_unzip was added . How can I roll back everything to 1.8.2 or 1.8.3?

pb-cdunn commented 7 years ago

FALCON_unzip is not actually installed, so it cannot be causing an issue by simply existing.

You might need the latest FALCON, which we might not have created a "release" for yet in FALCON-integrate. We noticed that cns_concurrent_jobs was not actually being used for the consensus jobs, and that is fixed on tip of master in FALCON.

I'll get back to you on the blocking PBS calls...

hodgett commented 7 years ago

Thanks. I'm just installing 1.8.2 to see if I can get the group going with that. In the later versions there are a couple of fc_run options which are useful on our system so will try the latest after I get something functional. I'll let you know how I go.

hodgett commented 7 years ago

I have performed some prelim tests. 1.8.6 still has the problem of fc_consensus hanging. 1.8.2 completes this step, albeit with EXIT 99 at 2-asm-falcon

pb-cdunn commented 7 years ago

I have performed some prelim tests. 1.8.6 still has the problem of fc_consensus hanging.

If I were on your system, I think I could debug this. There must be an explanation, but only you can find it. We would certainly appreciate if you do, as we have no way of testing on PBS ourself.

Unfortunately, I do not see a way to do a blocking qsub call in PBS, but I might be looking at the wrong man-page.

msc51 commented 7 years ago

I also have similar problems at the cns jobs. Some jobs went to completion very fast (producing the sentinel "done" files in a couple of hours), but most of my jobs seem to be waiting for something. The cns_xxxxx.fasta file was last written more than a day ago. It is not normal, is it?

My sysadmin told me that the call the code is in is as follows:

0 0x00000032128daff4 in _fxstat () from /lib64/libc.so.6

1 0x00000032128711cc in mmap_remap_check () from /lib64/libc.so.6

2 0x00000032128713cc in _IO_file_xsgetn_mmap () from /lib64/libc.so.6

3 0x0000003212867422 in fread () from /lib64/libc.so.6

4 0x0000000000405698 in Read_Trace ()

5 0x0000000000401fb2 in main ()

I am submitting via slurm. I allowed only 5 jobs running at a time. And it isn't an IO problem.

pb-jchin commented 7 years ago

@msc51 even for large genome work done here, I have never seen the consensus running longer than one day. (We do have quite reasonable file system here.)

Can you dump your 0-rawreads/*/cns_*.sh files ( depending on the version, the file location can be different)?

For example...

[fab@pod-alpha2 data]$ cat ~/yeast_example2/0-rawreads/preads/cns_00001/c_00001.sh
#!/bin/bash
set -vex

set -o pipefail
CUTOFF=10000
LA4Falcon -H$CUTOFF -fo  /localdisk/home/fab/yeast_example/0-rawreads/raw_reads.db /localdisk/home/fab/yeast_example/0-rawreads/m_00001/raw_reads.1.las | fc_consensus --output_multi --min_cov_aln 4 --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 8 >| cns_00001.fasta

In some very repetitive genome, if you don't limit the --max_n_read, you could run into I/O problem that makes the job haning.

msc51 commented 7 years ago

Here is one of my c_xxxxx.sh:

!/bin/bash

set -vex

set -o pipefail CUTOFF=$(python2.7 -m falcon_kit.mains.calc_cutoff --coverage 30.0 3000000000 <(DBstats -b1 /bscratch/nc_genome_assembly_hybrid/2_falcon_assembly/1_len_cutoff_pr_9000_try3/0-rawreads/raw_reads)) LA4Falcon -H$CUTOFF -fo /bscratch/nc_genome_assembly_hybrid/2_falcon_assembly/1_len_cutoff_pr_9000_try3/0-rawreads/raw_reads /bscratch/nc_genome_assembly_hybrid/2_falcon_assembly/1_len_cutoff_pr_9000_try3/0-rawreads/m_00016/raw_reads.16.las | fc_consensus --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 64 >| cns_00016.fasta

touch out.done

pb-jchin commented 7 years ago

Can you try --n_core 8 rather than --n_core 64?

msc51 commented 7 years ago

Should I interrupt my current Falcon run, change the config file (falcon_sense_option: --n_core 64 to 8), remove mypwatcher and then re-run it??

pb-jchin commented 7 years ago

well, I don't know how your system really work. If you want to try the different parameter, yes, you stop it and make sure all remaining running jobs are actually terminated. (Again, how to do it properly depending your specific system.) And remove or rename mypwatcher (I prefer renaming than removing as sometime the log file are useful. Then, restart.

Or, you can do a side experiment to see how the -n_core 8 goes. If it is good, then you restart afterward.

msc51 commented 7 years ago

Thanks @pb-jchin. I am running two jobs as side experiments: one with 8 cores, one with 64 cores. I will report back once I have the results

pb-cdunn commented 7 years ago

That stack-trace is interesting. Thanks.

Here is the line, in DALIGNER/LA4Falcon.c:

554     for (j = 0; j < novl; j++)
555        //  Read it in
556
557       { Read_Overlap(input,ovl);
558         if (ovl->path.tlen > tmax)
559           { tmax = ((int) 1.2*ovl->path.tlen) + 100;
560             trace = (uint16 *) Realloc(trace,sizeof(uint16)*tmax,"Allocating trace vector");
561             if (trace == NULL)
562               exit (1);
563           }
564         ovl->path.trace = (void *) trace;
565         Read_Trace(input,ovl,tbytes);
...

And in DALIGNER/align.c:

3039 int Read_Trace(FILE *input, Overlap *ovl, int tbytes)
3040 { if (tbytes > 0 && ovl->path.tlen > 0)
3041     { if (fread(ovl->path.trace, tbytes*ovl->path.tlen, 1, input) != 1)
3042         return (1);
3043     }
3044   return (0);
3045 }

It attempts to read a specific number of bytes. I don't think that call can block, so it must be reading an extremely large "item". Or maybe that call itself happens repeatedly, a large number of times. That might be a result of a bad .las file.

I would run LA4Falcon on that data directly, after recompiling with -g, and I'd trap this in a debugger.

Or, I would print the fread() arguments (or run strace with display-arguments flag).

Or, you can give me login access to your system.

Or, if small enough, you can provide us with your test-data. If it's not a PBS issue, we could then repro and debug.

msc51 commented 7 years ago

You are right, I haven't checked the raw_reads.x.las files.

I did an LAcheck on the two L2 las files (L2.16.1.las and L2.16.2.las), they are OK. However, after merging the two L2 las files into raw_reads.16.las, LAcheck on raw_reads.16.las failed with the error: raw_reads.16: Too few alignment records

I tried to re-do LAmerge step (LAmerge -v raw_reads.16 L2.16.1 L2.16.2) and the resulting raw_reads.16.las file is still not passing LAcheck.

Can you give me some advices on this? Most of my raw_reads.x.las files are having the same problem and re-doing LAmerge did not help..

hodgett commented 7 years ago

Unfortunately it will be difficult getting a stacktrace, I don't have the access, and we are severely down on HPC sysadmin staff at the moment so turn-around for this type of assistance is long.

I found my test run failed with v1.8.0 again. The logs showed worker threads for fc_consensus continuously re-spawning until the walltime was hit with --n_core 8. I re-ran the process, this time setting the option --n_core 0 for falcon_sense_option and it completed flawlessly. I am now trying the same test with v1.8.6 to see if it is even possible for a run to complete with this version. I will post the result.

hodgett commented 7 years ago

Well, that's interesting. Allocating 1 cpu to sge_option_cns and setting n_core 0 seems to be effective. I have had the run using v1.8.6 successfully complete for the first time.

pb-cdunn commented 7 years ago

I don't think there are any race conditions for these tasks, so I suspect you are having filesystem latency problems, though I'm not sure. Have you tried use_tmpdir = /tmp?

hodgett commented 7 years ago

I would be more than a little surprised if the problem turns out to be latency, I have tried this on our lustre and our SANS storage at what is a quite time of year for the system (in comparison). And I have now tried again, using n_core 8 and use_tmpdir=/data1 where /data1 is local disk to the node. It failed with no logs created in mywatcher/log for these processes, they are hung with 0% cpu. I am including as a sample the script in the tmpdir /data1/hodgett/pypetmp/lustre/work-lustre/Falcon_Test-20161227-1.8.6/0-rawreads/preads/cns_00002/c_00002.sh

#!/bin/bash
set -vex

set -o pipefail
CUTOFF=4000
LA4Falcon -H$CUTOFF -fo  /lustre/work-lustre/Falcon_Test-20161227-1.8.6/0-rawreads/raw_reads.db /lustre/work-lustre/Falcon_Test-20161227-1.8.6/0-rawreads/m_00002/raw_reads.2.las | fc_consensus --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 8 >| cns_00002.fasta

touch out.done

I also note that when I used n_core 0 (that did work) that the script c_00002.sh has a sub-script c_00002.sub.sh; c_00002.sh;

set -vex
cd /lustre/work-lustre/banana_team/Falcon_Test-20161223b/0-rawreads/preads/cns_00002
trap 'touch /lustre/work-lustre/banana_team/Falcon_Test-20161223b/0-rawreads/preads/cns_00002/cns_00002_done.exit' EXIT
ls -il c_00002.sub.sh
hostname
ls -il c_00002.sub.sh
time /bin/bash ./c_00002.sub.sh
touch /lustre/work-lustre/banana_team/Falcon_Test-20161223b/0-rawreads/preads/cns_00002/cns_00002_done

c_00002.sub.sh;

#!/bin/bash
set -vex

set -o pipefail
CUTOFF=4000
LA4Falcon -H$CUTOFF -fo  /lustre/work-lustre/banana_team/Falcon_Test-20161223b/0-rawreads/raw_reads /lustre/work-lustre/banana_team/Falcon_Test-20161223b/0-rawreads/m_00002/raw_reads.2.las | fc_consensus --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 1 >| cns_00002.fasta

So, obviously there is a difference between the two; n_core 0 vs >0. I am convinced that there is a bug in there somewhere, I don't know if it is only our system. Unfortunately I don't have the tools, capability, or system level access to debug it. I will instruct the guys to use n_core 0 until we have a chance to work it out in the coming months.

lijingjing1 commented 7 years ago

What is your raw_reads.2.las file size and other las files.It may be 2 Gb or bigger i guess.It almost your length cutoff and the daligner arguments set make the las file too big and contain a lot of redundance alignment.So in the consensus step it may make this situation happen.Also you can have a look at your machine Memory and Cache.

pb-cdunn commented 7 years ago

they are hung with 0% cpu

There must be a reason for that. We've never seen that behavior ourselves.

Maybe the filesystem has told your process that the file-size is Y, but only X bytes are available, so the call to fread() has blocked. And maybe such blocking prevents the file-system from updating the file to the latest version. That's why I say it could be "latency". Tricky issue. I'd love to investigate, but there is only so much I can do remotely.

mictadlo commented 7 years ago

Hi, On what kind of environment FALCON runs successfully? By any chance, do you provide a docker container with Falcon and Falcon_unzip?

Thank you in advance.

Mic

pb-cdunn commented 7 years ago

I would love to provide a Docker container. In fact, a sharp coder next to me on a NYE flight was extolling the virtues of Docker. But we cannot use that as our main solution because too many users/customers have an operating system too old for Docker. Sad but true!

Internally, we use SGE on NFS, often "lustre". We would like it to work everywhere, but we always need help for users for how to run their job-distribution system.

mictadlo commented 7 years ago

@pb-cdunn and @pb-jchin: I found out that my University is able to provide you with a Visitor Account which would allow you accessing our HPC cluster where FALCON is installed. This would give you the opportunity to debug FALCON on our cluster. In order to provide you with this access my University ask me to give them the following information about you:

Could you please send me the above information to m.lorenc at qut.edu.au?

We are looking forward to assemble our in-house sequenced (PacBio Sequel) genome with Falcon.

Thank you in advance.

Michal

pb-cdunn commented 7 years ago

This is what we use for blocking qsub calls:

[General]
use_tmpdir = true
pwatcher_type = blocking
job_type = string
job_queue =  qsub -S /bin/bash -sync y -V -q myqueue -N ${JOB_ID}
  -o "${STDOUT_FILE}"
  -e "${STDERR_FILE}"
  -pe smp ${NPROC}
  "${CMD}"

For pbs, you might need to change parts of that. E.g. apparently, -sync y should become

-W umask=0007,block=true

Please let us know if you find something that works. Note that the sge_option strings are ignored when you using "blocking" mode. (And note that, for now, all jobs use the same NPROC, but we will address that soon if people are actually using blocking-mode successfully.) You should still adjust XX_concurrent_jobs as needed.

It is wise to work this out using the tiny synth5k from FALCON-examples, or using test-pypeflow. You should not run a large genome until all job-distribution issues are completely solved first.

pb-cdunn commented 7 years ago

Also, you can first test on your local machine (i.e. not using qsub/pbs) with job_queue=bash -C ${CMD}. In that case, be sure that your machine has at least as many processors as your XX_concurrent_job settings.

pb-cdunn commented 7 years ago

I've updated the wiki for help with the "blocking" process-watcher:

hodgett commented 7 years ago

Ok, I have done some testing with blocking on our system which uses PBSpro. It looked promising to start with, but then failed with something I've not seen before, exit 15. The following is the fc_run.cfg used;

job_type = pbs
input_fofn = input.fofn
input_type = raw
length_cutoff = 500
length_cutoff_pr = 500
use_tmpdir = true
pwatcher_type = blocking
job_queue = qsub -S /bin/bash -V -q lyra -N ${JOB_ID} -o "${STDOUT_FILE}" -e "${STDERR_FILE}" -l nodes=1:ppn=${NPROC},mem=25gb -W umask=0007,block=true "${CMD}"

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007,block=true
sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007,block=true
sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007,block=true
sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007,block=true
sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007,block=true
sge_option_fc = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007,block=true
sge_option_cns = -l nodes=1:ppn=12,walltime=96:00:00,mem=24gb -W umask=0007,block=true

pa_concurrent_jobs = 32
cns_concurrent_jobs = 32
ovlp_concurrent_jobs = 32

pa_HPCdaligner_option =  -v -B128 -e.70 -l1000 -s1000 -M24
ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s200
ovlp_DBsplit_option = -x500 -s200

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 8
overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 12

The is a snip from all.log containing the error;

2017-02-14 03:25:18,535 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 8 (max_jobs=32)
2017-02-14 03:25:18,536 - pwatcher.blocking - DEBUG - query(which='list', jobids=<8>)
2017-02-14 03:25:18,536 - pypeflow.simple_pwatcher_bridge - INFO - sleep 1.2s
2017-02-14 03:25:19,595 - pwatcher.blocking - DEBUG - rc: 15
2017-02-14 03:25:19,737 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 8 (max_jobs=32)
2017-02-14 03:25:19,738 - pwatcher.blocking - DEBUG - query(which='list', jobids=<8>)
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - ERROR - Task Node(0-rawreads/job_0007) failed with exit-code=15
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - DEBUG - recently_done: [(Node(0-rawreads/job_0007), False)]
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - DEBUG - Num done in this iteration: 1
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - INFO - recently_satisfied: set([])
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - INFO - Num satisfied in this iteration: 0
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - INFO - Num still unsatisfied: 14
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - ERROR - Some tasks are recently_done but not satisfied: set([Node(0-rawreads/job_0007)])
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - ERROR - ready: set([])
submitted: set([Node(0-rawreads/job_0003), Node(0-rawreads/job_0005), Node(0-rawreads/job_0009), Node(0-rawreads/job_0008), Node(0-rawreads/job_0000), Node(0-rawreads/job_000c), Node(0-rawreads/job_0001)])
2017-02-14 03:25:19,738 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 7 (max_jobs=32)
hodgett commented 7 years ago

Hmm, might be my bad. Walltime exceeded, will need to specify a longer one in job-queue. I will try again.

pb-cdunn commented 7 years ago

How did it go? (Note that you can usually get more info by looking into the run-dir for a failing job.)

job_type = pbs
pwatcher_type = blocking
job_queue = qsub -S /bin/bash -V -q lyra -N ${JOB_ID} -o "${STDOUT_FILE}" -e "${STDERR_FILE}" -l nodes=1:ppn=${NPROC},mem=25gb -W umask=0007,block=true "${CMD}"

Actually, with a "blocking" pwatching, the job_type should be string, but I think it's ignored.... Yes, you should see this message in your log:

    if job_type != 'string':
        log.warning("In blocking pwatcher, job_type={!r}, should be 'string'".format(job_type))

If blocking mode works for PBS (at least on your system), then we will switch to that by default. Way easier configuration.

hodgett commented 7 years ago

Hi, I'm largely of the grid at the moment, I am in Java with no keyboard or decent connectivity so can't do any thing productive. What testing I have done so far has found blocking mode doesn't seem to work very well with pbspro. I haven't had time to delve too deeply in to why. I'll know more next month when I'm back and have had time to investigate.

Matthew

----- Reply message ----- From: "Christopher Dunn" notifications@github.com To: "PacificBiosciences/FALCON" FALCON@noreply.github.com Cc: "Matthew Hodgett" m.hodgett@qut.edu.au, "Author" author@noreply.github.com Subject: [PacificBiosciences/FALCON] Process getting stuck now at fc_consensus (#502) Date: Sat, Mar 4, 2017 7:48 AM

How did it go? (Note that you can usually get more info by looking into the run-dir for a failing job.)

job_type = pbs pwatcher_type = blocking job_queue = qsub -S /bin/bash -V -q lyra -N ${JOB_ID} -o "${STDOUT_FILE}" -e "${STDERR_FILE}" -l nodes=1:ppn=${NPROC},mem=25gb -W umask=0007,block=true "${CMD}"

Actually, with a "blocking" pwatching, the job_type should be string, but I think it's ignored.... Yes, you should see this message in your log:

if job_type != 'string':
    log.warning("In blocking pwatcher, job_type={!r}, should be 'string'".format(job_type))

If blocking mode works for PBS (at least on your system), then we will switch to that by default. Way easier configuration.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/PacificBiosciences/FALCON/issues/502#issuecomment-284079325, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AXi0iXsqTwHR27qJ-xyfZlqCvAQqOybSks5riIq8gaJpZM4LTerd.