Closed colindaven closed 2 years ago
Hello @colindaven!
The -single
parameter is for output only
$ cannoli-submit freebayes --help
INPUT : Location to pipe alignment records from (e.g. .bam, .cram, .sam). If
extension is not detected, Parquet is assumed.
OUTPUT : Location to pipe variant contexts to (e.g. .vcf, .vcf.gz, .vcf.bgz). If
extension is not detected, Parquet is assumed.
...
-single : Saves OUTPUT as single file. (default: false)
By default Spark writes to partitioned output files in parallel; -single
will concatenate these partitions into a single file, for VCF formatted output.
Using multiple BAM files for input should work fine, if they have the metadata that freebayes
requires (e.g. read groups, if I remember correctly).
Excellent, thanks for the feedback, then I will proceed with testing. Yes, read groups are required in multisample freebayes, that's true.
To revisit this, we're not quite sure about the command line syntax for entering multiple bams - here two. The second (well last arg before the >) seems to be a temp dir ?
cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam > /home/reinhard/cannoli/cannoli.vcf
Should the command actually be (for multisample SNP calling with 2+ bams, all with read groups set)
cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam /path/some/tmp > /home/reinhard/cannoli/cannoli.vcf
We have also tried (no > redirect), but are stil unsure how to specify tmp2.bam (and eventually tmp3-10+ bams etc).
cannoli-submit --driver-memory 8g --executor-memory 8g -- freebayes -reference /path/ref.fasta -single /path/tmp1.cram.bam /path/tmp2.cram.bam /home/reinhard/cannoli/cannoli.vcf
BTW, even with just one bam we are getting an error about the VCF header of the input not being correct.
/**
* Freebayes command line arguments.
*/
class FreebayesArgs extends FreebayesFnArgs with ADAMSaveAnyArgs with CramArgs with ParquetArgs {
@Argument(required = true, metaVar = "INPUT", usage = "Location to pipe alignment records from (e.g. .bam, .cram, .sam). If extension is not detected, Parquet is assumed.", index = 0)
var inputPath: String = null
From the code, it seems there is a mechanism for providing a path of BAMs. We tested this without success.
/path/to/bams/
/path/to/bams/*.bam
Another idea would be combining BAMs into one large metabam - my colleague thinks this is possible, I'm doubtful and have never done it though (how would freebayes deal with the read groups to output into VCF ?).
So it seems at the moment it's not possible to submit multiple BAMs to cannoli freebayes ?
This should work for a single input BAM file
cannoli-submit \
--driver-memory 8g --executor-memory 8g \
-- \
freebayes \
-reference /path/ref.fasta \
-single \
/path/tmp1.cram.bam \
/home/reinhard/cannoli/cannoli.vcf
For multiple input BAM files you need to use a glob in double quotes
cannoli-submit \
--driver-memory 8g --executor-memory 8g \
-- \
freebayes \
-reference /path/ref.fasta \
-single \
"/path/*.cram.bam" \
/home/reinhard/cannoli/cannoli.vcf
For example, I tried a couple of BAM files from https://github.com/nf-core/test-datasets
cannoli-submit \
--driver-memory 8g --executor-memory 8g \
-- \
freebayes \
-reference /nf-core-test-datasets/data/genomics/homo_sapiens/genome/genome.fasta \
-single \
"/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/*.paired_end.markduplicates.sorted.bam" \
test.vcf
...
22/08/04 12:08:28 INFO ADAMContext: Loading nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/*.paired_end.markduplicates.sorted.bam as BAM/CRAM/SAM and converting to Alignments.
22/08/04 12:08:28 INFO ADAMContext: Loaded header from file:/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/test.paired_end.markduplicates.sorted.bam
22/08/04 12:08:29 INFO ADAMContext: Loaded header from file:/nf-core-test-datasets/data/genomics/homo_sapiens/illumina/bam/test2.paired_end.markduplicates.sorted.bam
Command body threw exception:
java.lang.IllegalArgumentException: requirement failed: Read group dictionary contains multiple samples with identical read group ids.
They load correctly but run into an issue with read group ids.
Hi Michael,
thanks for the hint with the glob. Now reading multiple bam files works.
but writing the output vcf fails with:
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
The vcf_header file that gets written does contain the line in question. The input bams are correct, and freebase standalone does produce a vcf.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 1.0 failed 1 times, most recent failure: Lost task 15.0 in stage 1.0 (TID 945) (deei-bio184.kws-group.ads.dir executor driver): htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:119)
at org.bdgenomics.adam.ds.variant.VCFOutFormatter.read(VCFOutFormatter.scala:104)
at org.bdgenomics.adam.ds.OutFormatterRunner.
Sorry this hasn't been going smoothly!
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file ... The vcf_header file that gets written does contain the line in question.
Key to this error message is "input file" -- it is not the file being written that has an issue. There are a lot of transformations in a typical Cannoli-wrapped external process.
In RAM on Spark:
BAM on disk --> Spark Alignment
RDD or Dataset
Streamed from Spark to stdout
:
Spark Alignment
RDD or Dataset --> BAM-formatted stream
stdin
to external process:
BAM-formatted stream --> freebayes
stdout
from external process:
freebayes --> VCF-formatted stream
stdin
streamed into Spark:
VCF-formatted stream --> Spark VariantContext
RDD or Dataset
In RAM on Spark:
Spark VariantContext
RDD or Dataset --> VCF on disk
The error is referring to the VCF-formatted stream output by freebayes as it is being read back into Spark. Typically when I see this error freebayes has written an empty file or thrown an error without returning a non-zero shell error code. Spark log files are pretty awful to go through but you might be able to find what freebayes is complaining about in there.
If all this transforming gives you pause, consider using Parquet instead of BAM/VCF flat file formats, or run many analysis steps within the same Spark application (e.g. via cannoli-shell
or a custom java/scala library) instead of running them individually via cannoli-submit
.
Hope this helps!
Thanks for the detailed explanation. You are right, there is a freebase error in the spark log:
freebayes: unrecognized option '--strict-vcf' did you mean --stdin ?
I did not set this option in the canolli-submit command, nor could I find it in the cannoli repository.
I have an old freebayse version installed. Will try with a newer one.
With freebayes 1.3.6 it worked :-) Thanks a lot for your help and your work on cannoli!
Yes, this is really cool. We tested freebayes v1.3.6
vs freebayes cannoli and got very similar results.
We used rtg vcfeval
for this.
Basically the main differences are the reduction in the number of decimal places by a Java filter in Cannoli. Also, very complex variations are represented by multiple lines, often with the same start position. ie. variant positions can be repeated in a VCF (I had not seen this before).
We're pretty happy with that. Just comparing SNPs, as opposed to these completely unfiltered files, led to far less false pos and false negs (~300 each).
rtg vcfeval -b freebayes.vcf.gz -c cannoli_single.vcf.gz -o out3 -t x.sdf/ --sample xxxxxxxxx
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
None 5505941 5568733 4457 3940 0.9992 0.9993 0.9992
Great to hear!
Also, very complex variations are represented by multiple lines
Right, ADAM/Cannoli as with some other software always splits multi-allelic variants.
Closing as resolved.
Hi,
I'm interested in looking a multisample freebayes (10+ BAMs) using cannoli.
Is there any reason why this shouldn't work? For example, I see an ominous parameter --single in this nice example, does this mean only one bam can be supplied at once ?
https://gist.github.com/heuermh/ba33a061957e99b6719ed9712a7ace67
Thanks, Colin