bigdatagenomics / cannoli

Distributed execution of bioinformatics tools on Apache Spark. Apache 2 licensed.
Apache License 2.0
39 stars 17 forks source link

Interest in multisample freebayes calling #352

Closed colindaven closed 2 years ago

colindaven commented 2 years ago

Hi,

I'm interested in looking a multisample freebayes (10+ BAMs) using cannoli.

Is there any reason why this shouldn't work? For example, I see an ominous parameter --single in this nice example, does this mean only one bam can be supplied at once ?

https://gist.github.com/heuermh/ba33a061957e99b6719ed9712a7ace67

Thanks, Colin

heuermh commented 2 years ago

Hello @colindaven!

The -single parameter is for output only

$ cannoli-submit freebayes --help
 INPUT    : Location to pipe alignment records from (e.g. .bam, .cram, .sam). If
        extension is not detected, Parquet is assumed.
 OUTPUT    : Location to pipe variant contexts to (e.g. .vcf, .vcf.gz, .vcf.bgz). If
        extension is not detected, Parquet is assumed.
...
 -single    : Saves OUTPUT as single file. (default: false)

By default Spark writes to partitioned output files in parallel; -single will concatenate these partitions into a single file, for VCF formatted output.

Using multiple BAM files for input should work fine, if they have the metadata that freebayes requires (e.g. read groups, if I remember correctly).

colindaven commented 2 years ago

Excellent, thanks for the feedback, then I will proceed with testing. Yes, read groups are required in multisample freebayes, that's true.

colindaven commented 2 years ago

To revisit this, we're not quite sure about the command line syntax for entering multiple bams - here two. The second (well last arg before the >) seems to be a temp dir ?

cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam > /home/reinhard/cannoli/cannoli.vcf

Should the command actually be (for multisample SNP calling with 2+ bams, all with read groups set)

cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam /path/some/tmp > /home/reinhard/cannoli/cannoli.vcf

We have also tried (no > redirect), but are stil unsure how to specify tmp2.bam (and eventually tmp3-10+ bams etc).

cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam /home/reinhard/cannoli/cannoli.vcf

BTW, even with just one bam we are getting an error about the VCF header of the input not being correct.

colindaven commented 2 years ago

/**
 * Freebayes command line arguments.
 */
class FreebayesArgs extends FreebayesFnArgs with ADAMSaveAnyArgs with CramArgs with ParquetArgs {
  @Argument(required = true, metaVar = "INPUT", usage = "Location to pipe alignment records from (e.g. .bam, .cram, .sam). If extension is not detected, Parquet is assumed.", index = 0)
  var inputPath: String = null

From the code, it seems there is a mechanism for providing a path of BAMs. We tested this without success.

/path/to/bams/
/path/to/bams/*.bam

Another idea would be combining BAMs into one large metabam - my colleague thinks this is possible, I'm doubtful and have never done it though (how would freebayes deal with the read groups to output into VCF ?).

So it seems at the moment it's not possible to submit multiple BAMs to cannoli freebayes ?

heuermh commented 2 years ago

This should work for a single input BAM file

cannoli-submit \
  --driver-memory 8g --executor-memory 8g \
  -- \
  freebayes \
  -reference /path/ref.fasta \
  -single \
  /path/tmp1.cram.bam \
  /home/reinhard/cannoli/cannoli.vcf

For multiple input BAM files you need to use a glob in double quotes

cannoli-submit \
  --driver-memory 8g --executor-memory 8g \
  -- \
  freebayes \
  -reference /path/ref.fasta \
  -single \
  "/path/*.cram.bam" \
  /home/reinhard/cannoli/cannoli.vcf

For example, I tried a couple of BAM files from https://github.com/nf-core/test-datasets

cannoli-submit \
  --driver-memory 8g --executor-memory 8g \
  -- \
  freebayes \
  -reference /nf-core-test-datasets/data/genomics/homo_sapiens/genome/genome.fasta \
  -single \
  "/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/*.paired_end.markduplicates.sorted.bam" \
  test.vcf
...
22/08/04 12:08:28 INFO ADAMContext: Loading nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/*.paired_end.markduplicates.sorted.bam as BAM/CRAM/SAM and converting to Alignments.
22/08/04 12:08:28 INFO ADAMContext: Loaded header from file:/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/test.paired_end.markduplicates.sorted.bam
22/08/04 12:08:29 INFO ADAMContext: Loaded header from file:/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/test2.paired_end.markduplicates.sorted.bam
Command body threw exception:
java.lang.IllegalArgumentException: requirement failed: Read group dictionary contains multiple samples with identical read group ids.

They load correctly but run into an issue with read group ids.

magelan commented 2 years ago

Hi Michael,

thanks for the hint with the glob. Now reading multiple bam files works.

but writing the output vcf fails with:

Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file

The vcf_header file that gets written does contain the line in question. The input bams are correct, and freebase standalone does produce a vcf.

magelan commented 2 years ago

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 1.0 failed 1 times, most recent failure: Lost task 15.0 in stage 1.0 (TID 945) (deei-bio184.kws-group.ads.dir executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119) at org.bdgenomics.adam.ds.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104) at org.bdgenomics.adam.ds.OutFormatterRunner.(OutFormatter.scala:32) at org.bdgenomics.adam.ds.GenomicDataset.$anonfun$pipe$8(GenomicDataset.scala:872) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833)

heuermh commented 2 years ago

Sorry this hasn't been going smoothly!

Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file ... The vcf_header file that gets written does contain the line in question.

Key to this error message is "input file" -- it is not the file being written that has an issue. There are a lot of transformations in a typical Cannoli-wrapped external process.

In RAM on Spark: BAM on disk --> Spark Alignment RDD or Dataset

Streamed from Spark to stdout: Spark Alignment RDD or Dataset --> BAM-formatted stream

stdin to external process: BAM-formatted stream --> freebayes

stdout from external process: freebayes --> VCF-formatted stream

stdin streamed into Spark: VCF-formatted stream --> Spark VariantContext RDD or Dataset

In RAM on Spark: Spark VariantContext RDD or Dataset --> VCF on disk

The error is referring to the VCF-formatted stream output by freebayes as it is being read back into Spark. Typically when I see this error freebayes has written an empty file or thrown an error without returning a non-zero shell error code. Spark log files are pretty awful to go through but you might be able to find what freebayes is complaining about in there.

If all this transforming gives you pause, consider using Parquet instead of BAM/VCF flat file formats, or run many analysis steps within the same Spark application (e.g. via cannoli-shell or a custom java/scala library) instead of running them individually via cannoli-submit.

Hope this helps!

magelan commented 2 years ago

Thanks for the detailed explanation. You are right, there is a freebase error in the spark log:

freebayes: unrecognized option '--strict-vcf' did you mean --stdin ?

I did not set this option in the canolli-submit command, nor could I find it in the cannoli repository.

magelan commented 2 years ago

I have an old freebayse version installed. Will try with a newer one.

magelan commented 2 years ago

With freebayes 1.3.6 it worked :-) Thanks a lot for your help and your work on cannoli!

colindaven commented 2 years ago

Yes, this is really cool. We tested freebayes v1.3.6 vs freebayes cannoli and got very similar results.

We used rtg vcfeval for this.

Basically the main differences are the reduction in the number of decimal places by a Java filter in Cannoli. Also, very complex variations are represented by multiple lines, often with the same start position. ie. variant positions can be repeated in a VCF (I had not seen this before).

We're pretty happy with that. Just comparing SNPs, as opposed to these completely unfiltered files, led to far less false pos and false negs (~300 each).

rtg vcfeval -b freebayes.vcf.gz -c cannoli_single.vcf.gz -o out3 -t x.sdf/ --sample xxxxxxxxx

Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
     None            5505941        5568733       4457       3940     0.9992       0.9993     0.9992
heuermh commented 2 years ago

Great to hear!

Also, very complex variations are represented by multiple lines

Right, ADAM/Cannoli as with some other software always splits multi-allelic variants.

heuermh commented 2 years ago

Closing as resolved.