bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

AssertionErrror multisample joint calling #1340

Closed DiyaVaka closed 8 years ago

DiyaVaka commented 8 years ago

Hi Brad and group,

Recently we did a run bcbio on multisample joint calling and its failing. When we do single sample joint calling it works. Attached are the sample yaml and bcbio error files. Its complaining about coordinates, but I am not sure how did it work for single sample.

Attached are the bcbio err file and sample file.

Thanks,

bcbio.stderr.txt sample.yaml.txt

chapmanb commented 8 years ago

Thanks for the report and sorry about the issue. I believe the parallelization code is getting a bit confused because you're using a mix of joint calls (gatk-haplotype and platypus) and pooled calls (gatk):

     variantcaller: [gatk,gatk-haplotype,platypus]
      jointcaller: [gatk-haplotype-joint, platypus-joint]

Practically, you don't need joint calling here. It's typically used only when memory requirements for pooled calling are too high because of a large number of samples (250+). I'd suggest changing to:

     variantcaller: [gatk,gatk-haplotype,platypus]

and removing the joint caller line, and things should work cleanly. It will call your three samples together based on the shared batch name in the metadata section.

Hope this helps get things running. If you run into any issues please do re-open and we can investigate more.

DiyaVaka commented 8 years ago

The reason we wanted to do joint calling is because we are not 100% sure that some of the rare calls on the proband can be called. So we wan to use 48 normal samples so that we can catch these rare variants.

How can we achieve our goal without using joint calling?

chapmanb commented 8 years ago

Having the same metadata/batch for the samples will ensure they are called together and take advantage of having multiple samples. Batches of 48 samples can be called together without large memory requirements. Joint calling is only needed with bigger sample sizes and is a different approach (call samples individually, then combine together into one call set):

http://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#population-calling

Hope this helps.

DiyaVaka commented 8 years ago

Hi Brad,

I started our 2nd joint calling test run with 56 control samples. I am getting an error in prep_data method of multi.py Attached is the bcbio.stderr for it. I can send you the yaml file too if that helps.

Thanks. bcbio.stderr.txt

chapmanb commented 8 years ago

Diya; This error looks like you have jointcaller set in your configuration but not variantcaller. Is that possible? If so, setting both (http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#population-calling) and re-running in place should hopefully get things working.

DiyaVaka commented 8 years ago

Hi Brad,

As we use the control samples for every flowcell is there a way that we can run these control samples only once and store the data in temp location and just run the new samples every time. Can we have a set up in yaml where we can list bam files inplace of fastq's and have bwa to be false for those?

chapmanb commented 8 years ago

Diya; Thanks for the question. I'm a bit confused as what you mean. If you're running controls on flowcells don't you want to align and process those reads as well? If you don't want to process them you can either leave them off the YAML or turn alignment off. I'm not sure I understand why you'd want to sub in BAMs (from previous runs?) for the controls.

Practically any sort of specification you want is possible since this is a bit outside of bcbio and up to you as long as you give it the right YAML. Hope this helps.

DiyaVaka commented 8 years ago

Brad,

Sorry for not writing it in detail. We sequence control samples only once and sequence clinical samples every week. We want to run controls + clinical samples for every set. So we dont want to use resources to rerun the same control everytime in the bcbio pipeline, instead I want to somehow turn them off after initial run and use its alignment results for the next set so that I can just use joint calling on controls+clinical samples on every new set.

please let me know if its not clear.

chapmanb commented 8 years ago

Diya; Thanks for the explanation. You can do this in bcbio by passing the BAM file of the control as files and setting aligner: false in the algorithm configuration:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#alignment

Hope that provides what you need to do this. If you want to use a single template for the whole thing you can adjust the parameter for just the control in the template CSV to set aligner as false for those control samples. Hope this helps.

DiyaVaka commented 8 years ago

Awesome brad. Great that it takes bam files as input and we can give alignment to be false for those. Will try it and let you know if any issue arises

DiyaVaka commented 8 years ago

Hi Brad,

Recently we are experiencing time lags for gatk-haplotype. It took almost 2 weeks to run gatk-haplotype on 56 control samples. For joint calling can we give the gvcf's from control files so that we can run it only on our experiment samples and save time there

chapmanb commented 8 years ago

Diya; We don't have a way in bcbio to take in pre-created gVCFs and feed them into joint calling. It's a great idea to work on we only haven't had the opportunity to implement it. Sorry to not have anything you can use right now.

DiyaVaka commented 8 years ago

thank you..

DiyaVaka commented 7 years ago

Hi Brad,

We have a researcher who has 204 samples run at our core facility and she is interested in joint calling. They anticipate more samples in the near future and think running joint calling in n mode makes more sense. I would like to know how much resources does this kind of set up need and also can you give a rough estimate of how long this run can take. We will be using only gatk and gatk-haplotype. Attached is my test yaml for this project. sample.yaml.txt

chapmanb commented 7 years ago

Diya; Thanks for the question. Below is the configuration I'd suggest for doing gVCF based calling. There is no reason to use GATK UnifiedGenotyper as it's not nearly as good as HaplotypeCaller. You can also skip realignment with HaplotypeCaller and can generally skip recalibration unless you have read quality issues:

    algorithm:
      aligner: bwa
      variantcaller: gatk-haplotype
      jointcaller: gatk-haplotype-joint
      tools_on: [qualimap_full]
      coverage: /mnt/speed/qc/sequencing/biodata/capture_regions/ExV3_151111_HG19_HIVresevoirV1_EZ_HX1_primary_targets_nochr.bed
      coverage_interval: regional
      coverage_depth: high
      variant_regions: /mnt/speed/qc/sequencing/biodata/capture_regions/ExV3_151111_HG19_HIVresevoirV1_EZ_HX1_primary_targets_nochr.bed
      clinical_reporting: true

It's a bit hard to estimate read times but exomes are normally not excessively long. Hope this helps.

DiyaVaka commented 7 years ago

Thank you Brad. Can you please tell the pros and cons of N calling vs N+1 calling. We are interested in doing joint calling for 202 clinical cases and would like to know which procedure you think should we proceed with.

chapmanb commented 7 years ago

Diya; Generally we prefer joint calling (N+1 calling) since it gives more flexibility than pooled calling and scales well. Our validation work showed that both work well:

http://bcb.io/2014/10/07/joint-calling/

For 202 samples you're starting to get into larger size for batch (N) calling, so I'd use a joint calling approach. Hope this helps.

DiyaVaka commented 7 years ago

Thank you. We are running this model for the first time. How could we make sure what ever we do is actually correct? Are there any checks at each point to ensure this?

chapmanb commented 7 years ago

Diya -- I'm not sure what to suggest other than sanity checking outputs. bcbio tries to do the right thing and warn you with errors if something is wrong in general. Hopefully things work out cleanly for you.

DiyaVaka commented 7 years ago

Hi Brad,

Our joint calling stuff is complete. It stopped at qc stage but as we have already qc complete for single sample calling we did not continue the run.

There is gatk-haplotype folder, gatk-haplotype-joint folder which will have the vcf files. Will this method also produce any joint single vcf file which has all the samples in it?

chapmanb commented 7 years ago

Diya -- the gatk-haplotype-joint folder should have the joint called multisample VCF (joint/gatk-haplotype-joint/batchname/batchname-joint.vcf.gz). If you can figure out he QC errors the final outputs should also have this file along with the individual gVCF files from each sample. Hope this helps.

DiyaVaka commented 7 years ago

thanks for the quick response. based on the template you mentioned above I did not use batch. Without this we won't have the batch file? also we used n+1 model. Will this change things? Also is there any way that I can turn off qc? as of now i have tools_on:[qualimap_full]. What if i move to tools_off?

chapmanb commented 7 years ago

Diya -- sorry for any confusion. You do need the samples specified in the same batch to get a multiple sample VCF output. The batch parameter is how bcbio knows which samples to analyze together. Without this you'll get single sample gVCF and VCF outputs.

If you don't want most of the QC in bcbio you can do:

qc: [samtools]

which will only run the more minimal samtools QC step. Hope this helps for getting your data analyzed.

DiyaVaka commented 7 years ago

Thank you Brad. Just to understand better, running it in Batch mode performs calls differently than the single gvcfs? Or batch will just concatenate all the gvcfs into one big file?

Is joint calling considered not to be complete without the batch mode?

chapmanb commented 7 years ago

That's right, in batch mode GenotypeGVCFs gets called with all samples in the batch simultaneously so can take advantage of information from other samples -- so this is really joint calling. Without a batch it's calling only on the individual samples so is essentially single sample calling with a gVCF intermediate. How much the joint calling helps versus single sample is not an easy thing to answer generally and will depend on the depth; joint calling helps more the lower the depth is. Hope this helps.

DiyaVaka commented 7 years ago

Hi Brad,

we spent one month for running the above setup on 200+ samples. I did not provide the Batch mode information. When I look at the outputs i see a gatk-haplotype folder and it has 202 samples with the gatk-halotype calls for each of them. I also see there is gatk-haplotype-joint folder which has 202 sample folders each which name joint (1010-Ngv3-joint) and it has the following files

1010-NGv3-joint-joint-effects-stats.genes.txt
1010-NGv3-joint-joint-effects-stats.html 1010-NGv3-joint-joint-effects-filterSNP-filterINDEL-gatkclean.vcf.gz 1010-NGv3-joint-joint-effects.vcf.gz
1010-NGv3-joint-joint-effects-filterSNP-filterINDEL-gatkclean.vcf.gz.tbi 1010-NGv3-joint-joint-effects.vcf.gz.tbi
1010-NGv3-joint-joint-effects-filterSNP-filterINDEL.vcf.gz
1010-NGv3-joint-joint-files.list
1010-NGv3-joint-joint-effects-filterSNP-filterINDEL.vcf.gz.tbi
1010-NGv3-joint-joint.vcf.gz
1010-NGv3-joint-joint-effects-filterSNP.vcf.gz
1010-NGv3-joint-joint.vcf.gz.tbi
1010-NGv3-joint-joint-effects-filterSNP.vcf.gz.tbi

So will the batch files be different than above?Does it mean the joint files did not use the pooled samples information to make these files?Your comment above says that providing batch will again make calls and this will be another month of work.

chapmanb commented 7 years ago

Diya -- sorry about the confusion and problems. Without batch information they did not use pooled information to make the call, you have 202 samples with individual genotyping. The good news is that you can edit your sample YAML file to add batch information now, remove the checkpoints_parallel directory and re-run. That will re-use the existing single sample gVCF files and only re-do the final joint genotyping, resulting in a joint called file which should be much faster than re-running from scratch. Hope this works to get you the analysis you wanted.

DiyaVaka commented 7 years ago

Hi Brad,

Its been 2 weeks its still running after updating the script with Batch information. Lat 2 times it failed with the following error. Can you please help?

/mnt/speed/qc/sequencing/ihg-pipeline/sulggi/joint/gatk-haplotype-joint/Batch1/Y/Batch1-Y_0_15523052-b97.vcf.gz --variant /mnt/speed/qc/sequencing/ihg-pipeline/sulggi/joint/gatk-haplotype-joint/Batch1/Y/Batch1-Y_0_15523052-b98.vcf.gz --variant /mnt/speed/qc/sequencing/ihg-pipeline/sulggi/joint/gatk-haplotype-joint/Batch1/Y/Batch1-Y_0_15523052-b99.vcf.gz --variant /mnt/speed/qc/sequencing/ihg-pipeline/sulggi/joint/gatk-haplotype-joint/Batch1/Y/Batch1-Y_0_15523052-b100.vcf.gz --dbsnp /mnt/speed/qc/sequencing/bcbio/genomes/Hsapiens/GRCh37/variation/dbsnp-147.vcf.gz -nt 8 -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment ' returned non-zero exit status 134

chapmanb commented 7 years ago

Diya; Sorry about the issue. Would you be able to share more of the log file and traceback as a gist (https://gist.github.com/)? This tells me the command failed but doesn't give any information about what happened so I'm not sure what to suggest. The useful error message is usually at the start of the error traceback. It might be running into memory or other limitations but I can be more specific with additional information. Thanks much for the help debugging.

DiyaVaka commented 7 years ago

Thanks Brad, A fatal error has been detected by the Java Runtime Environment:

#
#  SIGSEGV (0xb) at pc=0x00002b31359800ed, pid=7544, tid=47774041732864
#
# JRE version: OpenJDK Runtime Environment (8.0_45-b14) (build 1.8.0_45-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.45-b02 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x9490ed]  ContiguousSpace::prepare_for_compaction(CompactPoint*)+0x15d
#
# Core dump written. Default location: /mnt/speed/qc/sequencing/ihg-pipeline/sulggi/core or core.7544
#
# An error report file with more information is saved as:
# /mnt/speed/qc/sequencing/ihg-pipeline/sulggi/hs_err_pid7544.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/

This is my structure for system.yaml

gatk:
    dir: /mnt/speed/qc/sequencing/bcbio/toolplus/gatk/2015.1.1-3.4.46-0-ga8e1d99
    jvm_opts:
    - -Xms500m
    - -Xmx40g
  gatk-haplotype:
    jvm_opts:
    - -Xms500m
    - -Xmx40g
  gatk-vqsr:
    jvm_opts:
    - -Xms500m
    - -Xmx40g
  hisat2:
    cores: 16
    memory: 2G
  kraken:
    cores: 4
    memory: 2g
  macs2:
    cores: 1
    memory: 8g
  miraligner:
    jvm_opts:
    - -Xms750m
    - -Xmx4500m

Should I give more memory. Its running on 8 cores.

chapmanb commented 7 years ago

Diya; Thanks for the additional details and the segfault. I'm a bit confused where this call is coming from. I read through the code and am not sure where we allow use of -nt 8. This option often causes problems so we've turned it off in a lot of places. What version of bcbio are you running? Would you be able to share a more full error log so I can see the full command line to know where the call comes from?

I checked in some speculative fixes based on other segfaults we've seen but am not sure if they'll fix the issue based on not being able to trace it back. You could test these or provide more information and we can continue to dig.

For the memory specifcations, the jvm_opts should be specified per-core so look much too high. If you've got 8 cores and 40G of memory per machine you'd want jvm_opts: [-Xms500m, -Xmx5g] and bcbio automatically scales up in relation to cores when running.

Hope this helps.

DiyaVaka commented 7 years ago

Attached are the bcbio.sh and std.err files. bcbio_stderr.txt bcbio_sh.txt bcbio version 0.9.4

chapmanb commented 7 years ago

Diya -- thanks for the details. I'm still a bit confused as we should have disabled -nt usage for GenotypeGVCFs with version 0.9.3 (https://github.com/chapmanb/bcbio-nextgen/commit/c55b02341388f5c89c3583a3c3fa7346d13c3057) due to some GATK3 bugs with GenotypeGVCFs. My best guess is that you're hitting something like that with the version of GATK3 you're running and might want to edit your local version of the install to not use -nt and see if that resolves the issue. Hope this helps get it running for you.

DiyaVaka commented 7 years ago

So you want me to comment this line below in gatkjoint.py script?

params += ["-nt", str(min(6, cores))]

chapmanb commented 7 years ago

That's right, although I suspect yours may be different somehow since it's allowing more than 6 cores. Hopefully you can find the point in your code base and adjust. Fingers crossed this fixes it.

DiyaVaka commented 7 years ago

Thank you. One thing which crossed my mind is the whole reason we eliminated Batch pooling and were interested in n+1 mode.but currently using Batch information what is it doing is similar to the pooling stuff. Just making sure we are indeed doing this correctly.

chapmanb commented 7 years ago

Diya -- all of the grouped strategies require analyzing all samples in a batch together. In pooled calling this needs all BAMs present and calls together from those BAMs. In joint calling we first call to generate gVCFs, then combine those gVCFs for the final calls. The advantage is that if you add a new sample you only need to repeat the final gVCF genotyping, not the initial calls. More details are here: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#population-calling

The step that's failing here in the combined gVCF joint calling step for your batch. Hope this helps provide context and matches what you're looking for.

DiyaVaka commented 7 years ago

Hi Brad,

As per your suggestion I commented the in the code and rerun it, still getting the same error. Attached hs_err_pid75099_log.txt is the error log. Can you please help?

chapmanb commented 7 years ago

Diya; Thanks for debugging and sorry that you're still hitting the same issue. Unfortunately the java core dump file doesn't give me any extra information to go on, so some thoughts to try and help:

Hope this helps.

DiyaVaka commented 7 years ago

Thanks Brad. I talked to the investigator and she says that may be its the chr Y which is having a issue. In your experience did you see any issues with 'Y' chr.

chapmanb commented 7 years ago

Diya -- we don't have any problems specifically with chrY. You're right that sometimes samples can have region specific issues due to depth or other problems but there is nothing general I can suggest to help with your specific problem. Hope some of the debugging ideas about help.

DiyaVaka commented 7 years ago

Hi Brad,

I cannot upgrade the version of the bcbio pipeline because its our clinical licensed version. I tried running the command that failed, outside of the pipeline but it still failed. At this point what i have is a Batch folder with all chr split into 100 sub vcfs.gz files. Can you please help me understand what happens in the background as we are failing to run this step through the version of bcbio that we are using.

Here is what I am planning to do. Please correct me if this is what not happens through the pipeline. Merge the 100 sub vcf files for each chr to one vcf file (using java -jar GenomeAnalysisTK.jar \ -T GenotypeGVCFs \ -R reference.fasta \ --variant sample1.g.vcf \ --variant sample2.g.vcf \ -o output.vcf)and then concatenate all these single vcf files from each chr to one big giant vcf. Will this procedure do the same stuff that bcbio does?

Thanks,

chapmanb commented 7 years ago

Diya; Thanks for following up on this. It's useful that you can replicate outside of bcbio. Would it be possible to use an updated version of GATK just for this merge step to determine if it's a bug in your version or something fundamentally problematic about the data? If it worked then you could plug that into the right place in bcbio and use it to do the rest of the calling and merging.

You're right that those the commands bcbio is automating, but I'm not sure how running them outside of bcbio, with the same versions will help. Won't you hit the same underlying issue? Is there anything bcbio is doing you can tweak to help resolve the issue?

Thanks much for the patience debugging this.

DiyaVaka commented 7 years ago

Hi,

For each chr I see there are multiple files which end in .vcf.gz. I expected only 1 file( which ends in .vcf.gz) will be there after merging. There are multiple (b0-b100 copies for 8 or 9). Can you please tell why is this so?If so can i merge (run genotype gvcf's) on each of these files( which does not have "b0-b100") example

Batch1-1_0_15541937.vcf.gz Batch1-1_108853084_142653859.vcf.gz Batch1-1_142667225_158225186.vcf.gz Batch1-1_15545758_31186553.vcf.gz Batch1-1_158225750_173735476.vcf.gz Batch1-1_173743388_190068305.vcf.gz Batch1-1_190129786_205687627.vcf.gz Batch1-1_205688607_221507352.vcf.gz Batch1-1_221509203_237023277-b9.vcf.gz Batch1-1_237024414_249250621.vcf.gz Batch1-1_31187031_46714286.vcf.gz Batch1-1_46715618_62228874.vcf.gz Batch1-1_93298481_108815952.vcf.gz

Based on your suggestion "If so, can you tweak by removing -XX:+UseSerialGC from the command line to see if that fixes it. We've used this to fix java core dumps on some multicore processes in recent versions of bcbio, but not sure if it's the same error."

I was able to get over the earlier failed command from java that it was complaining. I need to disable the use serial gc. How can I do this through bcbio. Is there a place I can disable the use serial usage.

chapmanb commented 7 years ago

Diya; Glad you're making some progress with this. In bcbio the SerialGC command gets added here:

https://github.com/chapmanb/bcbio-nextgen/blob/38301f71f35f64725184a5c6306ca828e2191442/bcbio/broad/__init__.py#L34

You'll have to trace back some to where it is in your older version but hopefully commenting it out there will let you proceed without core dumps.

For the batching, bcbio batches joint calling with more than 200 samples into chunks based on GATK3 recommendations. You can adjust how bcbio does this using the joint_group_size parameter, which defaults to 200:

http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html?highlight=joint_group_size#variant-calling

Hope this helps.

DiyaVaka commented 7 years ago

Thank you. I will do that. For research pipelines I want to go with new version of bcbio but I am getting these errors

[localhost] local: /home/sequencing/bcbio/anaconda/bin/conda info --envs --json [localhost] local: /home/sequencing/bcbio/anaconda/bin/conda create -y --name samtools0 Error: too few arguments, must supply command line package specs or --file

You can specify one or more default packages to install when creating an environment. Doing so allows you to call conda create without explicitly providing any package names.

To set the provided packages, call conda config like this:

conda config --add create_default_packages PACKAGE_NAME

Fatal error: local() encountered an error (return code 1) while executing '/home/sequencing/bcbio/anaconda/bin/conda create -y --name samtools0'

Aborting. Traceback (most recent call last): File "./bcbio-nextgen/scripts/bcbio_nextgen_install.py", line 251, in main(parser.parse_args(), sys.argv[1:]) File "./bcbio-nextgen/scripts/bcbio_nextgen_install.py", line 44, in main subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args)) File "/mnt/speed/qc/sequencing/bcbio/anaconda/envs/sequencing/lib/python2.7/subprocess.py", line 186, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/sequencing/bcbio/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir=/home/sequencing/bcbio', '--genomes=GRCh37', '--aligners', 'bwa', '--aligners', 'bowtie2', '--data']' returned non-zero exit status 1

chapmanb commented 7 years ago

Diya; Sorry about the install problem, I pushed a fix that should resolve this if you remove any temporary CloudBioLinux download directories (rm -rf tmpbcbio-install) and re-run the install/upgrade command you previously ran. It should pick off where it left off and provide an initial package for this environment. Thanks for catching this and hope it gets things cleanly installed for you.

DiyaVaka commented 7 years ago

hi Brad,

Thank you. Now I am having one more issue

[localhost] local: /home/sequencing/bcbio/anaconda/bin/conda info --envs --json [localhost] local: /home/sequencing/bcbio/anaconda/bin/conda create -y --name python3 python=3 Fetching package metadata ...........

PackageNotFoundError: Packages missing in current channels:

We have searched for the packages in the following channels:

Fatal error: local() encountered an error (return code 1) while executing '/home/sequencing/bcbio/anaconda/bin/conda create -y --name python3 python=3'

Aborting. Traceback (most recent call last): File "bcbio_nextgen_install.py", line 251, in main(parser.parse_args(), sys.argv[1:]) File "bcbio_nextgen_install.py", line 44, in main subprocess.check_call([bcbio, "upgrade"] + _clean_args(sys_argv, args)) File "/mnt/speed/qc/sequencing/bcbio/anaconda/envs/sequencing/lib/python2.7/subprocess.py", line 186, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/sequencing/bcbio/anaconda/bin/bcbio_nextgen.py', 'upgrade', '--tooldir=/home/sequencing/bcbio', '--genomes', 'GRCh37', '--aligners', 'bwa', '--aligners', 'bowtie2', '--data']' returned non-zero exit status 1 You have new mail in /var/spool/mail/sequencing

chapmanb commented 7 years ago

Diya; Thanks for the report and sorry about the problems. I'm confused as to what is going on here, samtools0 is the name of the next environment to create and I don't see it in the command line that fails so I'm not sure how it causes this failure. I can't reproduce this hear so am totally confused.

At this point, I'd try removing the existing half finished install and re-start to see if that clears up the problem. Sorry to not have a better clue but hope this avoids the issue and gets your install finished.

DiyaVaka commented 7 years ago

As per your suggestion i wiped off all the directories and downloaded the install script and restarted things from scratch. Still I got the same error. I can try do it it again but I am pretty sure it will end up with same error message