broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
977 stars 369 forks source link

ExtractIlluminaBarcodes and IlluminaBasecallsToSam crash with `too many open files` error #1714

Closed matthdsm closed 3 years ago

matthdsm commented 3 years ago

Bug Report

Affected tool(s)

ExtractIlluminaBarcodes IlluminaBasecallsToSam

Affected version(s)

Description

When running ExtractIlluminaBarcodes and IlluminaBasecallsToSam the tool fails intermittently with too many open files error. This issue existed in v.2.25.6, but seems to have become consistent in v2.26.0. Raising the user ulimit does not help, nor does settings MAX_RECORDS_IN_RAM to a higher value.

Steps to reproduce

Run either tool with mostly default params

        picard -Xmx~{memory_mb}m ExtractIlluminaBarcodes \
        --BASECALLS_DIR ~{basecalls_dir} \
        --OUTPUT_DIR . \
        --BARCODE_FILE ~{barcode_file} \
        --LANE ~{lane} \
        --METRICS_FILE "barcode_metrics_L~{lane}.txt" \
        --READ_STRUCTURE ~{read_structure} \
        --MAX_MISMATCHES ~{max_mismatches} \
        --MAX_RECORDS_IN_RAM 2000000 \
        --NUM_PROCESSORS ~{num_cpu}

        picard -Xmx~{memory_mb}m IlluminaBasecallsToSam \
        --BASECALLS_DIR ~{basecalls_dir} \
        --BARCODES_DIR . \
        --LANE ~{lane} \
        --READ_STRUCTURE ~{read_structure} \
        --RUN_BARCODE ~{run_barcode} \
        --SEQUENCING_CENTER CMGG \
        --LIBRARY_PARAMS ~{library_params} \
        --IGNORE_UNEXPECTED_BARCODES true \
        --INCLUDE_NON_PF_READS false \
        --MAX_RECORDS_IN_RAM 2000000 \
        --NUM_PROCESSORS ~{num_cpu}

Expected behavior

Tools should finish without error

Actual behavior

Tools fail with error too many open files

First mentioned in https://github.com/broadinstitute/picard/issues/1699#issuecomment-899542458 ping @caherzee, this is the one we discussed at the meeting last week.

I'll provide a more detailed stacktrace shortly.

M

matthdsm commented 3 years ago

Full stacktrace with a bunch of different errors: stderr.txt This includes logs for both tools

gbggrant commented 3 years ago

Hi @matthdsm what is your ulimit set at?

matthdsm commented 3 years ago

Hi George,

Thanks for your reply. Currently, the ulimit is set to 32000, but we've went as high as 64000 without any changes.

M

gbggrant commented 3 years ago

Hi @matthdsm from the key developer on this:

echo "* soft nofile 64000" >> /etc/security/limits.conf echo "* hard nofile 128000" >> /etc/security/limits.conf sysctl -w fs.file-max=128000 sysctl -p 64k soft, 128k hard

Also, if you are running a in a container you need to make sure the docker daemon includes the same options: echo "OPTIONS=\"--default-ulimit nofile=64000:128000\" " > /etc/sysconfig/docker

matthdsm commented 3 years ago

Hi George, Sadly that won't work, both the hard and softlimit is already set to a 10 fold of what you suggest.

Matthias

gbggrant commented 3 years ago

@matthdsm this is strange. A couple more questions / suggestions: 1) What sequencer are you using? 2) How many libraries?

Could you try running something like: watch -n 10 ls -l /proc/$$/fd to watch the open file handle count as you are running these tools.

Also, just before running each, can you cat /proc/sys/fs/file-max - just to verify that the ulimit is set properly.

matthdsm commented 3 years ago

We're using a NovaSeq600 with S1 flowcells (for now). Typically there are 2-3 libraries on both lanes. We would like to expand this to S4 flowcells with multiple libraries containing several hundred samples.

I'll go ahead and try to monitor the open filehandles on our next run.

Thanks M

matthdsm commented 3 years ago
$ sudo cat /proc/sys/fs/file-max
6550835

ExtractIlluminaBarcodes starts throwing errors at 16377 open fh's. The runtime also seems to have worsened since in comparison to v2.25.6. ExtractIlluminaBarcodes used to run in 10 min with 30 threads on v2.25.6 , while it took 43 min with 70(!) threads on v2.26.1

I've added the output here for reference

Thanks for looking into this. M

matthdsm commented 3 years ago

When localising the cbcl files, the open fh's hover at around 2K files and the process finished without errors in about 3 min (v2.26.1). Same thing for v2.25.6, but then it finishes in about 2min.

IlluminaBasecallsToSam (v2.26.1) also finishes successfully, although it took forever (1h50m runtime on 70 threads). The open fh hover around 22K for the entire process.

Keeping all this in mind, I can surmise that the too many open files error was because the basecalls were read from a network share instead of local storage

PS: all of the number above are for only one lane, so double that for the complete flowcell

gbggrant commented 3 years ago

Yes, that makes a lot of sense. Maybe NFS has its own file limit, or has more overhead with regard to keeping track of file usage. Definitely is going to be a lot more performant doing EIB and IlluminaBaseCallsToSam on local storage. You might also want to look at using IlluminaBaseCallsToSam new MATCH_BARCODES_INLINE parameter - this adds inline demultiplexing so you should be able to get around using the EIB call.

matthdsm commented 3 years ago

I didn't know the MATCH_BARCODES_INLINE option even existed! This seems a very useful addition, since it eliminates the need using both a barcodes and a library_params file. Are you open for feature requests (in a separate issue)? Currently we use the barcodes file from EIB to get the most frequent undetermined barcodes. It would be useful if these were added to the barcode metrics file output (cfr bcl2fastq stats).

M

matthdsm commented 3 years ago

I've opened a new issue (#1722) with my question. This issue seems resolved to me, since it was caused mostly by FS limitations.

Thanks for all the help! Matthias