joaks1 commented 7 years ago

I am having an issue where several stray ipcluster, python, and cat processes continue to run on compute nodes after assemblies finish. Our cluster uses the TORQUE job scheduler, and and one of my example job submission scripts is below. The analyses run fine and exit without any warnings or errors, but a bunch of processes are still running and using a fair bit of resources (see attached screenshot of 'top' info on a compute node long after an assembly finished). Either I or the HPC admin have to ssh in and kill these processes on the compute nodes, and convince the scheduler that they are no longer using resources.

I am using: conda 4.3.25 Python 2.7.13 :: Continuum Analytics, Inc. ipyrad 0.7.13

######################################################################

! /bin/sh

PBS -l nodes=1:ppn=20

PBS -l walltime=60:00:00

PBS -j oe

PBS -l jobflags=ADVRES:jro0014_lab.56281

if [ -n "$PBS_JOBNAME" ] then source "${PBS_O_HOME}/.bash_profile" cd "$PBS_O_WORKDIR" condaenv on condaenv fi

ipyrad -p params-plates123-ct80.txt -s 345 -c 20 1>ipyrad-plates123-steps345-ct80.sh.out 2>&1 ######################################################################

Here's the content of the 'params' file mentioned in the job submission script:

###################################################################### ------- ipyrad params file (v.0.7.13)------------------------------------------- plates123-ct80 ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps /gpfs01/scratch/jro0014/gekgo/data/genomes/msg/ipyrad ## [1] [project_dir]: Project dir (made in curdir if not present) Merged: plate1, plate2, plate3 ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq files Merged: plate1, plate2, plate3 ## [3] [barcodes_path]: Location of barcodes file Merged: plate1, plate2, plate3 ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq files denovo ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)

[6] [reference_sequence]: Location of reference sequence file

rad ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc. TATG, ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2) 4 ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read 33 ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard) 6 ## [11] [mindepth_statistical]: Min depth for statistical base calling 6 ## [12] [mindepth_majrule]: Min depth for majority-rule base calling 10000 ## [13] [maxdepth]: Max cluster depth within samples 0.80 ## [14] [clust_threshold]: Clustering threshold for de novo assembly 0 ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes 1 ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter) 35 ## [17] [filter_min_trim_len]: Min length of reads after adapter trim 2 ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences 4 ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2) 5 ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2) 4 ## [21] [min_samples_locus]: Min # samples per locus for output 20 ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2) 8 ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2) 0.5 ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2) 0, 0, 0, 0 ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs) 0, 0, 0, 0 ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2) n ## [27] [output_formats]: Output formats (see docs)

[28] [pop_assign_file]: Path to population assignment file

######################################################################

Many Thanks!

Jamie

screenshot from 2017-09-18 12-39-52

isaacovercast commented 7 years ago

Hi Jamie, That is a super annoying problem, thanks for reporting it. Considering that the assembly actually completed successfully, I'm sure this was a difficult problem to even identify!

I see from your params file that you've merged 3 plates after step 2. I am going to guess that there were samples replicated across plates? There are only a couple places in the code that use cat for the steps that you ran and I'm guessing about the replicated samples because I haven't seen this problem before and replicated samples are a bit unusual. Can you attach the ipyrad_log.txt to this issue? I will try to reproduce this behavior on my cluster.

isaacovercast commented 7 years ago

Hey Jamie, If you can post the *.json file in your project directory that would be helpful. Thx, -isaac

joaks1 commented 7 years ago

Thanks for the prompt reply, Isaac.

I merged three plates after step 2, but there are no replicated samples.

Yes, it's a difficult one to diagnose, and it seems to be inconsistent. Some jobs don't seem to have this problem while others do, and I haven't been able to correlate any differences associated with the jobs that don't clean up after themselves. However, this is mostly anecdotal at this point. I have been running lots of similar assemblies of the same data at the same time, and by the time jobs are queued, run, and I realize they've left running processes, I haven't been able to do a great job of diagnosing the problem (or the exact run that was responsible).

Sorry, I realize this is a terrible issue report, because I've been unable to pin down any details. Now that I know to look out for it, I will try to find a way to consistently replicate the problem.

I'm attaching the ipyrad_log.txt and an example json file, but a couple of caveats associated with them: (1) I have been playing around with a lot of different parallel assemblies in the same directory, so the log file is going to be a mess. (2) I cannot guarantee that the json file is directly associated with a run that didn't clean up, but I can guarantee that it's nearly identical to a run that did. Also, I changed the json extension to txt to make github happy.

ipyrad_log.txt plates123-ct80.txt

isaacovercast commented 7 years ago

Hi Jamie,

Mmmm, the log file is useful, I think I see the problem. I would bet a fair amount that this is involved:

2017-09-17 11:18:14,041     pid=7586    [cluster_within.py] ERROR   sample [RMB_13902_Cyrtodactylus_philippinicus_Luzon] failed in step [build_clusters]; error: IOError([Errno 2] No such file or directory: '/gpfs01/scratch/jro0014/gekgo/data/genomes/msg/ipyrad/plate1_edits/RMB_13902_Cyrtodactylus_philippinicus_Luzon_derep.fastq')

I see tons of these for the run that started 2017-09-15 23:03. This is very typically, almost exclusively in my experience, a disk space issue. Dereplication during step 3 creates temporary files, and if the cluster runs out of disk (or hits a user quota) you get this IOError, it can't create more files if the disk is full. The code doesn't do a good job handling the error at this point, so that's why it isn't reported accurately. The crucial insight is that this also totally explains the non-deterministic nature of this behavior. Depending on how much disk you have, and how many parallel assemblies are running, and the timing of these assemblies (we clean up temporary files pretty aggressively), there may be more or less resource consumption. Sometimes everything runs fine, and sometimes disk usage piles up and you hit the limit. When this happens the internal calls to cat are zombied because we aren't handling the error properly and you get the behavior you're seeing. Slam dunk! 👍 🥇

I am almost certain this is a resource allocation limit issue, either disk usage or max number of files. Can you verify that you aren't pegging either of these values? Each cluster seems to have its own unique quota system, some of which are more arcane than others.

I will add some simple error checking during this step to trap resource allocation issues. Will let you know when they are implemented and pushed.

Thanks for reporting! This is a very nasty bug. Let me know how this sounds from your end after you check out quota limits, etc, etc.

-isaac

isaacovercast commented 7 years ago

It's really hard to test these kinds of bugs because it's hard to replicate a system with pegged disk without causing a whole bunch of other headaches, which is kind of an upside-down excuse for why these bugs still exist in our codebase 😉 😜

joaks1 commented 7 years ago

Thanks, Isaac! This sounds possible. I will run it by my HPC admin and report back. We are not limited on available drive space (our scratch drive is enormous), but I wouldn't be surprised if there are (undocumented) per-job limits on space and/or file numbers. Hopefully this is the issue, and the HPC admin will simply lift the limits. I'll let you know what I learn.

joaks1 commented 7 years ago

Quick question: Do temp files get written to /tmp? Or are they all written within the assembly directory? Thanks!

isaacovercast commented 7 years ago

tmp files get written to the assembly directory in all cases.

joaks1 commented 7 years ago

Infor from HPC admin: we have a 10-million file limit on /scratch. Any chance that ipyrad would produce that many temp files? Of course, I have some files on /scratch too, but ipyrad would have to create close to 10-million files to hit the limit.

isaacovercast commented 7 years ago

Hey Jamie, That seems like an improbably high # of files for a denovo assembly, it just doesn't create that many temp files, even if you were running many assemblies simultaneously. What about number of open files, I know this is sometimes an issue on some systems.

joaks1 commented 7 years ago

'ulimit -n' reports 32768 as the limit on the number of open files. I'm guessing Python itself will throw an 'IOError: Too many open files' long before that (most systems I've worked on, Python doesn't allow much more than 1000 files to be open).

isaacovercast commented 7 years ago

Ah, yes, that's probably true. Good call.

On Fri, Sep 22, 2017 at 3:29 PM, Jamie Oaks notifications@github.com wrote:

'ulimit -n' reports 32768 as the limit on the number of open files. I'm guessing Python itself will throw an 'IOError: Too many open files' long before that (most systems I've worked on, Python doesn't allow much more than 1000 files to be open).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/268#issuecomment-331539681, or mute the thread https://github.com/notifications/unsubscribe-auth/AFsrvwgFrazHzhwdUGYDutKgiV_H2L-Mks5slAqygaJpZM4PbUEX .

dereneaton commented 6 years ago

Hey guys, just catching up on this. We've tried to wrap the code so that any errors encountered will always shutdown the ipcluster instance, and therefore shutdown any remote jobs, but it seems to be failing in this case. It's not clear to me from the error why that is. Leaving behind zombie cat jobs is totally weird. Hopefully we can figure it out.

joaks1 commented 6 years ago

After failing to replicate the zombie processes in the same HPC environment with the same version, I'm giving up and chalking up the previous occurrences to HPC "voodoo". If my HPC admin or I notice them again, I will re-open the issue, but for now, things are working smoothly. Thanks, and sorry for the trouble!

isaacovercast commented 6 years ago

Yeah, "HPC voodoo" is a well documented phenomenon. Seems like eye of newt and crystals aren't enough. Sometimes, you gotta get the eye of the exact species of newt, and sometimes obsidian works better than quartz.... Generally I consider HPC configurations to display non-deterministic behavior. If only we could harness this for our research! (a la quantum computing) ... Glad it's working for you now. Thanks for reporting...

dereneaton / ipyrad

Stray processes on compute nodes after runs finish #268

! /bin/sh

PBS -l nodes=1:ppn=20

PBS -l walltime=60:00:00

PBS -j oe

PBS -l jobflags=ADVRES:jro0014_lab.56281

[6] [reference_sequence]: Location of reference sequence file

[28] [pop_assign_file]: Path to population assignment file