broadinstitute / long-read-pipelines

Long read production pipelines
https://broadinstitute.github.io/long-read-pipelines/
BSD 3-Clause "New" or "Revised" License
140 stars 24 forks source link

Too few reads in bam for Sniffles #111

Open brigranger opened 4 years ago

brigranger commented 4 years ago

In the PBCCSWholeGenomeSingleFlowcell workflow, in CallSVs, Sniffles task I got an error where it sounds like there's too few reads in the bam to estimate some parameter it needs? Log file excerpt follows:

++ samtools view -H /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam
++ grep -m1 '^@RG'
++ sed 's/\t/\n/g'
++ grep '^SM:'
++ sed s/SM://g
+ SM=SM-JOTZQ_RW
+ sniffles -t 8 -m /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam -v SM-JOTZQ_RW.m64020_200118_025318.sniffles.pre.vcf -s 3 -r 1000 -q 20 --genotype --report_seq --report_read_strands
Estimating parameter...
Too few reads detected in /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam

This is using Terra with the method imported from dockstore: github.com/broadinstitute/long-read-pipelines/PBCCSWholeGenomeSingleFlowcellVersion: 2.0-dockstore-test-2

Let me know if any other details are needed.

SHuang-Broad commented 4 years ago

Hi Brian, thanks for the report.

Do you know the coverage of this input bam?

I have a suspicion that this bam might have a low coverage, hence Sniffles cannot estimate error rate and other parameters from the few reads.

brigranger commented 4 years ago

@SHuang-Broad I think you're right.

I wasn't able to find any specific coverage report for this bam, so I ran samtools idxstats on it and it appears it has one read aligned on chr2, and that's it. So it's not terribly surprising it failed.

I guess it would be nice if the pipeline failed a little more gracefully and maybe was able to continue on? But there's really nothing here to do much with...

kvg commented 4 years ago

Wow, one read? Is this a test BAM file or real data?

SHuang-Broad commented 4 years ago

@brigranger hmm.... Typically variants is the last step for the pipelines now, so if this one particular task fails, no downstream tasks are severely impacted practically speaking.

But I do think there could be more QC done at the beginning (or maybe middle) of the workflow, i.e. quit early when things are really suspicious.

What do you think, @kvg ?

kvg commented 4 years ago

This is an interesting one. Thinking ahead to amplicon sequencing, there will certainly be cases where we parallelize over chromosomes and some will have no data. So it looks like we'll have to figure out a way to protect ourselves from those kind of failures when the tools we're running don't have those protections built-in themselves.

How to do that robustly will require some thought. @brigranger can you send us a link to the original input BAM?

brigranger commented 4 years ago

@kvg You should be able to find it here: gs://broad-gp-pacbio/r64020_20200116_203442/2_B01/

SHuang-Broad commented 4 years ago

I looked at the subreads bam, it seems to be a low yield CCS flowcell.

gsutil du -sh gs://broad-gp-pacbio/r64020_20200116_203442/2_B01/m64020_200118_025318.subreads.bam
151.93 GiB   gs://broad-gp-pacbio/r64020_20200116_203442/2_B01/m64020_200118_025318.subreads.bam

So it looks like something in CCS that produced ultralow depth.

SHuang-Broad commented 4 years ago

@kvg , yes, I believe this is ultimately tied to what QC we put in each step of the pipeline, as by nature has to be an implement-as-we-encounter issue, considering that there could be many places where things go wrong.

SHuang-Broad commented 4 years ago

@brigranger is this the only occasion where you see the error? If so, I'm going to close this and deal with #113 instead as a solution.

brigranger commented 4 years ago

I've only tried one other sample, but that seemed to get past this. Fine with me to switch priorities.

On Sun, May 17, 2020, 4:28 PM Steve Huang notifications@github.com wrote:

@brigranger https://github.com/brigranger is this the only occasion where you see the error? If so, I'm going to close this and deal with #113 https://github.com/broadinstitute/long-read-pipelines/issues/113 instead as a solution.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/long-read-pipelines/issues/111#issuecomment-629855935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIXJGTQ3CHFEZ6H4SJJC6LRSBCGLANCNFSM4MHBMOEQ .