broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
976 stars 369 forks source link

CrosscheckFingerprints NIO streaming of BAMs hangs tool indefinitely #1175

Closed sooheelee closed 5 years ago

sooheelee commented 6 years ago

Bug Report

DavidR says to bring this to your attention @yfarjoun.

Affected tool(s)

CrosscheckFingerprints called from GATK4.0.4.0

Affected version(s)

Description

Steps to reproduce

Leaving this to Picard folks to try to reproduce.

Expected behavior

Looks like either (i) the tool isn't compatible with NIO and should produce an error or (ii) there is a bug in the NIO process for this tool that should be fixed.

sooheelee commented 6 years ago

Should reply back to the researcher about whether the tool cannot use NIO or that this is a bug that is being fixed.

ruslanafr commented 6 years ago

Any estimate on when this patch might come out? Our team really needs it as soon as possible. Thanks!

yfarjoun commented 6 years ago

I'm a little confused about this, as we most certainly use CrosscheckFingerprints in production using the NIO capabilities. Could you try using the Picard cloudJar for this just to minimized one additional variable?

Also, please make sure that the index for the bam is available and readable.

sooheelee commented 6 years ago

Looks like the Picard cloudjar has been a feature starting with release v2.18.1 (March 26, 2018). The standalone jars are downloadable from the normal Picard release page at https://github.com/broadinstitute/picard/releases. If this works for you @ruslanafr, then I would recommend continuing to use this jar until we figure out what is going on with calling Picard from GATK.

sooheelee commented 6 years ago

@yfarjoun, @ruslanafr is using the tool with two bams as input:

-I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-TTP1-A-1-1-D-A553-36/REBC-AC8T-TTP1-A-1-1-D-A553-36.bam \
-I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-NB1-A-1-0-D-A525-36/REBC-AC8T-NB1-A-1-0-D-A525-36.bam \

Would this make a difference?

yfarjoun commented 6 years ago

That shouldn't make any difference.

lbergelson commented 6 years ago

It could be their hitting some connection problem and going into a retry loop or some sort. We have pretty insanely long retry attempts configured. 20 hours seems long, but it's possible something isn't working right in the retry code. If they could get a few live stack traces of the running program to see what it's doing it might help? If they can access the machine it runs on they can do so using jstack.

On Sat, May 26, 2018, 1:27 PM Yossi Farjoun notifications@github.com wrote:

That shouldn't make any difference.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/1175#issuecomment-392275868, or mute the thread https://github.com/notifications/unsubscribe-auth/AEe4rFERazCK0NTMxrmYgDuSuxcQ34hNks5t2ZCCgaJpZM4UHMzl .

sooheelee commented 6 years ago

@lbergelson, @ruslanafr is a researcher at the Broad. From the forum post:

/usr/local/jre1.8.0_73/bin/java -Xmx14g -jar /gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar CrosscheckFingerprints \
-I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-TTP1-A-1-1-D-A553-36/REBC-AC8T-TTP1-A-1-1-D-A553-36.bam \
-I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-NB1-A-1-0-D-A525-36/REBC-AC8T-NB1-A-1-0-D-A525-36.bam \
-H /cromwell_root/firecloud-tcga-open-access/tutorial/reference/Homo_sapiens_assembly19.haplotype_database.txt \
--QUIET false --EXIT_CODE_WHEN_MISMATCH 0 \
--OUTPUT crosscheck.stats.txt \
--VALIDATION_STRINGENCY LENIENT
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/fc-4aacc2d6-4017-4fc2-b95a-fb892a3562b9/70dc8591-32c2-4de1-b2f8-2af887f2a8e3/Clinical_Workflow/6794a873-5bbb-46ef-ae2c-a10e77dab94e/call-CrossCheckLaneFingerprints_Task/attempt-2/tmp.197aeb50
17:12:16.387 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu May 17 17:12:16 UTC 2018] CrosscheckFingerprints  --INPUT gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-TTP1-A-1-1-D-A553-36/REBC-AC8T-TTP1-A-1-1-D-A553-36.bam --INPUT gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-NB1-A-1-0-D-A525-36/REBC-AC8T-NB1-A-1-0-D-A525-36.bam --OUTPUT crosscheck.stats.txt --HAPLOTYPE_MAP /cromwell_root/firecloud-tcga-open-access/tutorial/reference/Homo_sapiens_assembly19.haplotype_database.txt --EXIT_CODE_WHEN_MISMATCH 0 --QUIET false --VALIDATION_STRINGENCY LENIENT  --CROSSCHECK_MODE CHECK_SAME_SAMPLE --LOD_THRESHOLD 0.0 --CROSSCHECK_BY READGROUP --NUM_THREADS 1 --CALCULATE_TUMOR_AWARE_RESULTS true --ALLOW_DUPLICATE_READS false --GENOTYPING_ERROR_RATE 0.01 --OUTPUT_ERRORS_ONLY false --LOSS_OF_HET_RATE 0.5 --EXPECT_ALL_GROUPS_TO_MATCH false --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu May 17 17:12:16 UTC 2018] Executing as root@45dc4fe55a48 on Linux 4.9.0-0.bpo.6-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.0.4.0
INFO    2018-05-17 17:12:18 CrosscheckFingerprints  Fingerprinting 2 INPUT files.

But then the task just gets stuck without printing out anything else.

@ruslanafr, is it possible for you to access the machine you are running this on and try what @lbergelson suggests above with jstack?

ruslanafr commented 6 years ago

This specific submission was run on FireCloud, but I tried it on my Mac and got the same results. I can try using jstack on Tuesday.

Thanks! Ruslana

On May 26, 2018, at 18:47, sooheelee notifications@github.com<mailto:notifications@github.com> wrote:

@lbergelsonhttps://github.com/lbergelson, @ruslanafrhttps://github.com/ruslanafr is a researcher at the Broad. From the forum post:

/usr/local/jre1.8.0_73/bin/java -Xmx14g -jar /gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar CrosscheckFingerprints \ -I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-TTP1-A-1-1-D-A553-36/REBC-AC8T-TTP1-A-1-1-D-A553-36.bam \ -I gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-NB1-A-1-0-D-A525-36/REBC-AC8T-NB1-A-1-0-D-A525-36.bam \ -H /cromwell_root/firecloud-tcga-open-access/tutorial/reference/Homo_sapiens_assembly19.haplotype_database.txt \ --QUIET false --EXIT_CODE_WHEN_MISMATCH 0 \ --OUTPUT crosscheck.stats.txt \ --VALIDATION_STRINGENCY LENIENT Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/fc-4aacc2d6-4017-4fc2-b95a-fb892a3562b9/70dc8591-32c2-4de1-b2f8-2af887f2a8e3/Clinical_Workflow/6794a873-5bbb-46ef-ae2c-a10e77dab94e/call-CrossCheckLaneFingerprints_Task/attempt-2/tmp.197aeb50 17:12:16.387 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.4.0/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so [Thu May 17 17:12:16 UTC 2018] CrosscheckFingerprints --INPUT gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-TTP1-A-1-1-D-A553-36/REBC-AC8T-TTP1-A-1-1-D-A553-36.bam --INPUT gs://fc-66b68450-9d98-490f-934f-a9d824aac4be/REBC-AC8T-NB1-A-1-0-D-A525-36/REBC-AC8T-NB1-A-1-0-D-A525-36.bam --OUTPUT crosscheck.stats.txt --HAPLOTYPE_MAP /cromwell_root/firecloud-tcga-open-access/tutorial/reference/Homo_sapiens_assembly19.haplotype_database.txt --EXIT_CODE_WHEN_MISMATCH 0 --QUIET false --VALIDATION_STRINGENCY LENIENT --CROSSCHECK_MODE CHECK_SAME_SAMPLE --LOD_THRESHOLD 0.0 --CROSSCHECK_BY READGROUP --NUM_THREADS 1 --CALCULATE_TUMOR_AWARE_RESULTS true --ALLOW_DUPLICATE_READS false --GENOTYPING_ERROR_RATE 0.01 --OUTPUT_ERRORS_ONLY false --LOSS_OF_HET_RATE 0.5 --EXPECT_ALL_GROUPS_TO_MATCH false --VERBOSITY INFO --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false [Thu May 17 17:12:16 UTC 2018] Executing as root@45dc4fe55a48 on Linux 4.9.0-0.bpo.6-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.0.4.0 INFO 2018-05-17 17:12:18 CrosscheckFingerprints Fingerprinting 2 INPUT files.

But then the task just gets stuck without printing out anything else.

@ruslanafrhttps://github.com/ruslanafr, is it possible for you to access the machine you are running this on and try what @lbergelsonhttps://github.com/lbergelson suggests above with jstack?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/picard/issues/1175#issuecomment-392292752, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZj7KLJF42SL6rq28XtiBix1pQ0KpKrfks5t2duQgaJpZM4UHMzl.

kcibul commented 6 years ago

I ran into this same problem myself today running from both local as well as a GCE node. Credentials/ADC is fine, and I can gsutil the files and run them locally no problem so I don't think it's a credentials issue.

I also looked at the production/best practice pipeline, and it doesn't look like it's using NIO. Here's the snipped from the task which is localizing files

task CrossCheckFingerprints {
  Array[File] input_bams
  Array[File] input_bam_indexes

Finally, in case it helps, I took a threaddump while it was running (30+ minutes).

"pool-2-thread-1" #14 prio=5 os_prio=0 tid=0x00007f5d58ee9800 nid=0x2c7 runnable [0x00007f5d23b80000] java.lang.Thread.State: RUNNABLE at com.sun.crypto.provider.GaloisCounterMode.doLastBlock(GaloisCounterMode.java:394) at com.sun.crypto.provider.GaloisCounterMode.decryptFinal(GaloisCounterMode.java:552) at com.sun.crypto.provider.CipherCore.finalNoPadding(CipherCore.java:1046) at com.sun.crypto.provider.CipherCore.doFinal(CipherCore.java:983) at com.sun.crypto.provider.AESCipher.engineDoFinal(AESCipher.java:491) at javax.crypto.Cipher.doFinal(Cipher.java:2377) at sun.security.ssl.CipherBox.decrypt(CipherBox.java:461) at sun.security.ssl.InputRecord.decrypt(InputRecord.java:172) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1015)

yfarjoun commented 6 years ago

interesting.

The place I know that we use crosscheck with NIO is in the joint-calling pipeline, but there we are comparing vcfs against vcfs, not bams against bams...so I'm wondering if there's a problem with the BAM implementation somehow....anyhow, good information will attempt to reproduce.

ruslanafr commented 6 years ago

I downloaded picardcloud.jar and ran it locally. This didn't work either. The printout was exactly the same as before. My credentials are ok since I can use gsutil to copy the files to my machine.

This is the threaddump from jstack for this run:

"pool-2-thread-1" #15 prio=5 os_prio=31 tid=0x00007fe445249000 nid=0x1407 runnable [0x000070000d6e0000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)

This looks very similar to what @kcibul printed, so I think this is probably the same issue.

ruslanafr commented 6 years ago

I'm not sure if this issue is connected in any way, but I also tried to run the CleanSam tool (using both GATK4.0.4.0 and picardcloud.jar) with no success. The error message I got in both cases is:

Exception in thread "main" htsjdk.samtools.SAMException: Cannot read non-existent file: file:///Users/rfrazer/Projects/moap/testing_staging_area/gs:/5aa919de-0aa0-43ec-9ec3-288481102b6d/tcga/OV/WGA_RepliG/WXS/BI/ILLUMINA/TCGA_MC3.TCGA-04-1335-11A-01W-0489-09.bam at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:426) at htsjdk.samtools.util.IOUtil.assertFileIsReadable(IOUtil.java:413) at picard.sam.CleanSam.doWork(CleanSam.java:76) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

It's clear that the tool tries to access the given path from my current directory and not GCS.

The above stacktrace does indicate that CrosscheckFingerprints tries to access GCS, but I thought this is worth mentioning in case the two issues are somehow connected.

sooheelee commented 6 years ago

Fyi, I can use NIO on CRAMs and BAMs with GATKv4.0.4.0 tools (PrintReads and CNV tool), so the engine itself does support successful NIO.

lbergelson commented 6 years ago

@ruslanafr That error makes sense most picard tools don't support GCS at all.

sooheelee commented 6 years ago

@lbergelson, can you clarify further please? When a Picard tool is hooked up to the GATK engine, and is a walker or other such type of tool that should theoretically be able to use NIO, then can such tools perform NIO via the GATK engine? Or, do the Picard tools individually have to support NIO to work within GATK?

lbergelson commented 6 years ago

Most picard tools are imported directly into GATK without modification. There isn't any way to automatically upgrade them to use NIO. Each one needs to be individually modified and as far as I know most have not been.

ruslanafr commented 6 years ago

Any progress on this bug?

yfarjoun commented 6 years ago

Hi @ruslanafr

Thanks for reporting this. It seems that there has been a regression and that in HTSJDK reading many small intervals of a bam file over NIO is now unacceptably slow.

I've submitted an htsjdk issue: https://github.com/samtools/htsjdk/issues/1161 and hope that it will get resolved soon.