broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
969 stars 370 forks source link

ExtractIlluminaBarcodes opening a lot of files, error "Too many open files" #1801

Open GATKSupportTeam opened 2 years ago

GATKSupportTeam commented 2 years ago

This request was created from a contribution made by Robert Altwasser on April 19, 2022 10:09 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/5461192217627-Picard-Too-many-open-files-

--

I am demultiplexing a S4 sequencing run and Picard ExtractIlluminaBarcodes opens to many files which crashes the run. It's dual index data with UMIs and I need unmapped BAM files with the umi sequence. I checked the MD5sum of the raw data several times and I also run a check on the Basecall dir.

I monitored the open files of the process with 'lsof' and it quickly exceeds 120000 files, which is the maximum that I can set with 'ulimit -n' .

Here is the RunInfo:

<Read Number="1" NumCycles="148" IsIndexedRead="N"/>
<Read Number="2" NumCycles="17" IsIndexedRead="Y"/>
<Read Number="3" NumCycles="8" IsIndexedRead="Y"/>
<Read Number="4" NumCycles="148" IsIndexedRead="N"/>

a) Versions:

The Genome Analysis Toolkit (GATK) v4.2.5.0

HTSJDK Version: 2.24.1

Picard Version: 2.25.4

Java: openjdk version "1.8.0_312"

b) Exact command used:

(bash) $ ulimit -n 100000
picard -Xmx110g -Djava.io.tmpdir=/data/gpfs-1/users/altwassr_c/scratch/tmp/ -Xms110g \

ExtractIlluminaBarcodes \

-B /data/gpfs-1/users/altwassr_c/scratch/data/220325_A00643/Data/Intensities/BaseCalls/ \

-L 1 \

--NUM_PROCESSORS 1 \

-M metrices/barcode_metrices1.txt \

-BARCODE_FILE /data/gpfs-1/users/altwassr_c/work/projekte/barcode1.csv \

-RS 148T8B9M8B148T \

--MAX_RECORDS_IN_RAM 1000000000 \

--TMP_DIR /data/gpfs-1/users/altwassr_c/scratch/tmp/

c) Log: ``

ERROR   2022-04-19 04:41:06     ExtractIlluminaBarcodes Error processing tile 2140                                    

picard.PicardException: File not found: (/data/gpfs-1/users/altwassr_c/scratch/data/220325_A00643_0438_BH22YTDSX2/Data/Intensities/BaseCalls/L002/C237.1/L002_1.cbcl)

        at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:93)                                   

        at picard.illumina.parser.readers.CbclReader.readHeader(CbclReader.java:127)                                  

        at picard.illumina.parser.readers.CbclReader.readTileData(CbclReader.java:200)                                

        at picard.illumina.parser.readers.CbclReader.advance(CbclReader.java:275)                                     

        at picard.illumina.parser.readers.CbclReader.hasNext(CbclReader.java:252)                                     

        at picard.illumina.parser.NewIlluminaDataProvider.hasNext(NewIlluminaDataProvider.java:125)                   

        at picard.illumina.ExtractIlluminaBarcodes$PerTileBarcodeExtractor.run(ExtractIlluminaBarcodes.java:363)      

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)                                    

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)                                                   

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                            

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                            

        at java.lang.Thread.run(Thread.java:748)           

Caused by: java.io.FileNotFoundException: /data/gpfs-1/users/altwassr_c/scratch/data/220325_A00643_0438_BH22YTDSX2/Data/Intensities/BaseCalls/L002/C237.1/L002_1.cbcl (Too many open files)

        at java.io.FileInputStream.open0(Native Method)    

        at java.io.FileInputStream.open(FileInputStream.java:195)                                                     

        at java.io.FileInputStream.(FileInputStream.java:138)                                                   

        at picard.illumina.parser.readers.BaseBclReader.open(BaseBclReader.java:90)                                   

        ... 11 more                                        

INFO    2022-04-19 04:41:06     ExtractIlluminaBarcodes Extracting barcodes for tile 2141                             

ERROR   2022-04-19 04:41:06     ExtractIlluminaBarcodes Error processing tile 2141                                    

picard.PicardException: Unrecognized data type(Cbcl) found by IlluminaDataProviderFactory!                            

        at picard.illumina.parser.IlluminaDataProviderFactory.makeParser(IlluminaDataProviderFactory.java:400)        

        at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:249)  

        at picard.illumina.parser.IlluminaDataProviderFactory.makeDataProvider(IlluminaDataProviderFactory.java:228)  

        at picard.illumina.ExtractIlluminaBarcodes$PerTileBarcodeExtractor.run(ExtractIlluminaBarcodes.java:355)      

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)                                    

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)                                                   

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                            

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                            

      at java.lang.Thread.run(Thread.java:748)

(created from Zendesk ticket #281653)
gz#281653

gbrandt6 commented 2 years ago

@gbggrant this error came up on the GATK Forum. Is there anything going wrong with ExtractIlluminaBarcodes that it is opening 120000 files? This user has a limit of 100000. Here they have already tried increasing --MAX_RECORDS_IN_RAM.

gbggrant commented 2 years ago

We've seen some reports of this, I believe that Fulcrum Genomics (who submitted some recent changes on this code) are looking into it.