fgbio failing silently on generation of UMI consensus reads

ameynert commented 6 years ago

New batch of samples, identical formatting, set-up, config to previous batch, which ran successfully. The UMI config is fastq_name.

Successful log file snippet looks like this:

[2017-10-25T16:55Z] INFO:umis.umis:Processed 16459742 alignments.
[2017-10-25T16:55Z] Index BAM file: 5628-sort.bam
[2017-10-25T16:57Z] UMI consensus fastq generation
[2017-10-25T16:57Z] [2017/10/25 17:57:13 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 0.2.1-SNAPSHOT as ameyner2@node2c17.ecdf.ed.ac.uk on JRE 1.8.0_121-b15 with snappy
[2017-10-25T16:57Z] [2017/10/25 17:57:13 | FgBioMain | Info] Executing CallMolecularConsensusReads from fgbio version 0.2.1-SNAPSHOT as ameyner2@node2c17.ecdf.ed.ac.uk on JRE 1.8.0_121-b15 with snappy

Unsuccessful run log output is truncated:

[2017-11-21T11:44Z] INFO:umis.umis:Processed 19514399 alignments.
[2017-11-21T11:44Z] Index BAM file: 2935-sort.bam
[2017-11-21T12:13Z] UMI consensus fastq generation

fgbio was called (from command log):

[2017-11-21T11:44Z] /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/bcbio/galaxy/../anaconda/bin/samtools index -@ 4 /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/align/2935/2935-sort.bam /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpSpFTXE/2935-sort.bam.bai
[2017-11-21T12:13Z] unset JAVA_HOME && fgbio -Xms2g -Xmx8g -XX:+UseSerialGC --tmp-dir /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpXGBn7V --async-io=true --compression=0 GroupReadsByUmi --edits=1 --min-map-q=1 -t RX -s adjacency -i /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/align/2935/2935-sort.bam | fgbio -Xms2g -Xmx8g -XX:+UseSerialGC --tmp-dir /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpXGBn7V --async-io=true --compression=0 CallMolecularConsensusReads --min-input-base-quality=2 --min-reads=1 --max-reads=100000 --output-per-base-tags=false --sort-order=:none: -i /dev/stdin -o /dev/stdout | fgbio -Xms2g -Xmx8g -XX:+UseSerialGC --tmp-dir /gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpXGBn7V --async-io=true --compression=0 FilterConsensusReads --min-reads=1 --min-base-quality=13 -r /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa -i /dev/stdin -o /dev/stdout | bamtofastq collate=1 T=/gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/align/2935/2935-sort-cumi-1-bamtofastq-tmp F=/gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpXGBn7V/2935-sort-cumi-1.fq.gz F2=/gpfs/igmmfs01/eddie/HGS-OvarianCancerB/work/2017-11-17.100ng-2935/bcbiotx/tmpXGBn7V/2935-sort-cumi-2.fq.gz tags=cD,cM,cE gz=1

I see there was a commit on 7 Oct to use a new version of fgbio; possibly this is the cause.

chapmanb commented 6 years ago

Alison; Thanks much for this report. I've been trying to reproduce and haven't been able to replicate this. It's running cleanly for me with the recent fgbio release (0.4.0) and the latest code. What behavior do you see if the problematic run? Does it hang at this point? What does the output directory align/2935/ look like? Are any consensus files being generated? Thanks for any additional information you can provide to help track what is going on.

ameynert commented 6 years ago

It doesn't hang at all - the job exits with no errors. No consensus FASTQ is generated. Work directory contents look like this:

.:
total 0
drwx--S--- 3 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:24 align
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:24 align_prep
drwx--S--- 4 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 12:13 bcbiotx
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:23 checkpoints_parallel
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:23 log
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:23 provenance

./align:
total 0
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 12:12 2935

./align/2935:
total 1.8G
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 1.8G Nov 21 11:44 2935-sort.bam
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 4.6M Nov 21 12:12 2935-sort.bam.bai

./align_prep:
total 256K
lrwxrwxrwx 1 ameyner2 igmm_datastore_HGS-OvarianCancerB  98 Nov 21 10:24 2935_2935_R1.fastq.gz -> ../../../../../../../exports/igmm/eddie/HGS-OvarianCancerB/reads/2017-11-17.100ng/2935_R1.fastq.gz
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 13K Nov 21 10:24 2935_2935_R1.fastq.gz.gbi
lrwxrwxrwx 1 ameyner2 igmm_datastore_HGS-OvarianCancerB  98 Nov 21 10:24 2935_2935_R2.fastq.gz -> ../../../../../../../exports/igmm/eddie/HGS-OvarianCancerB/reads/2017-11-17.100ng/2935_R2.fastq.gz
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 13K Nov 21 10:24 2935_2935_R2.fastq.gz.gbi

./bcbiotx:
total 0
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 10:23 tmpGWWFKP
drwx--S--- 2 ameyner2 igmm_datastore_HGS-OvarianCancerB 512 Nov 21 12:13 tmpXGBn7V

./bcbiotx/tmpGWWFKP:
total 0

./bcbiotx/tmpXGBn7V:
total 0

./checkpoints_parallel:
total 0

./log:
total 512K
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 3.0K Nov 21 12:13 bcbio-nextgen-commands.log
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 134K Nov 21 12:13 bcbio-nextgen-debug.log
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB  937 Nov 21 10:24 bcbio-nextgen.log

./provenance:
total 256K
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 895 Nov 21 10:23 data_versions.csv
-rw------- 1 ameyner2 igmm_datastore_HGS-OvarianCancerB 832 Nov 21 10:23 programs.txt

chapmanb commented 6 years ago

Alison -- thanks for the details. I'm not sure what is going on and why bcbio doesn't catch the error. Is it possible to share the BAM file with me off list (https://github.com/chapmanb/) and I can try to reproduce? I guess the changes to either fgbio or adding in the separate FilterConsensusReads step are causing some kind of issue.

Practically, bcbio should fail after this step with the missing fastq consensus reads, is that right? I want to be sure it's not continuing on with processing; silent failures and skips are very bad.

Thanks again for helping to debug.

ameynert commented 6 years ago

Sorry Brad, I'm not allowed to share any of these out of the EU. Could ask Miika in Cambridge?

Yes, an error message of some sort would be appropriate.

mjafin commented 6 years ago

If I can be of help we can download the data on our UK cluster where Brad also has access. Would that help Alison?

ameynert commented 6 years ago

Thank you, Miika, that would be very helpful, since you're on this project anyway. Can I send you the FASTQ for one of the samples that worked under 1.0.5 and for a sample that didn't work under 1.0.6? The raw FASTQ was produced identically, as far as I know, and I have processed it into bcbio-ready FASTQ with the same scripts, and the files look the same to me in terms of structure.

-- Alison Meynert alison.meynert@igmm.ed.ac.uk

MRC Human Genetics Unit MRC IGMM University of Edinburgh Western General Hospital Crewe Road, Edinburgh EH4 2XU United Kingdom

From: Miika Ahdesmaki notifications@github.com Sent: 23 November 2017 12:55:08 To: chapmanb/bcbio-nextgen Cc: MEYNERT Alison; Author Subject: Re: [chapmanb/bcbio-nextgen] fgbio failing silently on generation of UMI consensus reads (#2167)

If I can be of help we can download the data on our UK cluster where Brad also has access. Would that help Alison?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/2167#issuecomment-346611216, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGbyA7O9xqxbNUCZAhyE1mJRqVtv73ajks5s5WssgaJpZM4Ql4VF.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

mjafin commented 6 years ago

Hi Alison, That would work for me. How big are the files? I can provide a Box folder if necessary.

ameynert commented 6 years ago

About 3.5Gb total for 4 FASTQ files with the UMIs in the headers (as generated by the bcbio UMI pre-processing scripts).

-- Alison Meynert alison.meynert@igmm.ed.ac.uk

MRC Human Genetics Unit MRC IGMM University of Edinburgh Western General Hospital Crewe Road, Edinburgh EH4 2XU United Kingdom

From: Miika Ahdesmaki notifications@github.com Sent: 23 November 2017 15:34:41 To: chapmanb/bcbio-nextgen Cc: MEYNERT Alison; Author Subject: Re: [chapmanb/bcbio-nextgen] fgbio failing silently on generation of UMI consensus reads (#2167)

Hi Alison, That would work for me. How big are the files? I can provide a Box folder if necessary.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/2167#issuecomment-346648026, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGbyA7HEf7uzUy9Z5FkQkYvHA3AmIwKDks5s5ZCRgaJpZM4Ql4VF.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

chapmanb commented 6 years ago

Alison and Miika; Thanks so much for the help making the data available. I ran these and have, I guess, good and bad news all rolled into one: these worked for me with the latest development version (1.0.7a-05fc3c9, but I don't think the specific revision matters) and fgbio (0.4.0). So it's bad I can't reproduce and offer any suggestions but good that a potential path to getting running cleanly is to upgrade either bcbio or fgbio and then run. I wish I had a better clue about the underlying behavior but hope this provides a useful path forward to getting your analysis finished. Thanks again for all the help debugging and please let me know if upgrading does not fix the problem.

ameynert commented 6 years ago

I was still seeing the issue on the development version, but now think I've solved this by bumping the fgbio default memory request to Xms4g and Xmx8g explicitly in postalign.py. That's what I had specified in my galaxy YAML config as default jvm_opts, so I'm not sure why they weren't feeding through to fgbio.

chapmanb commented 6 years ago

Alison; Thanks for continuing to debug this and for the additional details. I've been trying to replicate this here by running with both lower and higher memory and all I can manage to do is trigger memory errors and failures from fgbio, not any type of silent skipping. I'm totally baffled as to what I'm missing to reproduce the problem.

What memory specification do you get if you don't change it directly in postalign.py? What are you specifying in bcbio_system.yaml? Maybe if I can debug what is going on there that will give some more clues. Thanks again for helping with testing and debugging.

ameynert commented 6 years ago

Brad, it appears to be something peculiar to our cluster. I have it working interactively on a standalone node with an unedited postalign.py from release 1.0.6, using 8 cores and Xms2g Xmx4g. When I submit the same job to the cluster (standard SGE Linux, using a bash script wrapping a call to bcbio_nextgen.py -t local) it's failing. I am totally puzzled.

mjafin commented 6 years ago

I wonder if your SGE is somehow configured to kill processes that eat more resources than allocated? I'm not sure if SGE does this (not on our system) - from experience Slurm eagerly does.

Could also ask the system administrators to monitor the nodes for memory use during the failure?

chapmanb commented 6 years ago

Alison; Thanks for following up and sorry about this inconsistent behavior. I'd definitely like to squash this if we can figure out the underlying cause. bcbio should error out if the jobs get killed by the scheduler, and I'm not sure how they'd continue on without the consensus output. If you can understand the behavior on your system I'd be keen to add in checks to try and avoid continuing and at least raise the correct error if things fail. Thanks much for the help debugging this.

ameynert commented 6 years ago

Brad, I'll check into that. I know we have previously had issues with Java memory requirements and the system mis-interpreting how much memory is actually available.

roryk commented 5 years ago

Thanks, closing this as it is stale and there has been a bunch of fgbio work since then, so it is likely to not be an issue. Please re-open if I am mistaken.

bcbio / bcbio-nextgen

fgbio failing silently on generation of UMI consensus reads #2167