HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
233 stars 27 forks source link

malformed metadata and header in temporary vcfs when running clair3 with whatshap phase with large bam file #336

Open blackbeerd opened 2 weeks ago

blackbeerd commented 2 weeks ago

I'm running clair3 with the --use_whatshap_for_final_output_haplotagging option (see clair3 command at bottom) on a 1.1TB bam generated on the ONT platform (basecalling and alignment with dorado/minimap2) and the final output bam is missing haplotype info. On step 7/7 "Phasing VCF output in parallel using WhatsHap" whatshap throws parse and invalid contig errors that indicate it is trying to read a ##cmdline comment in the vcf as a contig:

[INFO] 7/7 Phasing VCF output in parallel using WhatsHap
This is WhatsHap 1.7 running under Python 3.9.0
[W::vcf_parse] Contig '##cmdline=/path/to/mambaforge/envs/clair3/bin/run_clair3.sh --bam_fn=/path/to/input.bam --ref_fn=/path/to/ref_files/hg38/no_chr/Homo_sapiens.GRCh38.dna.primary_assembly.fa --model_path=/path/to/mambaforge/envs/clair3/bin/models/r941_prom_hac_g360+g422 --output=/path/to/clair3_output --threads=8 --platform=ont --snp_min_af=0.3 --min_coverage=10 --use_whatshap_for_final_output_haplotagging --call_snp_only' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##cmdline=/path/to/mambaforge/envs/clair3/bin/run_clair3.sh --bam_fn ....

After checking input vcf used for this step (called "merge_1.vcf"), I found what appears to be a malformed vcf metadata where there is a "##cmdline" entry AFTER the field header line, #CHROM POS ... etc., like so:

##fileformat=VCFv4.2
##source=Clair3
##clair3_version=1.0.10
##cmdline=/mambaforge/envs/clair3/bin/run_clair3.sh --bam_fn=input.bam --ref_fn=/ref_files/hg38/no_chr/Homo_sapiens.GRCh38.dna.primary_assembly.fa --model_path=mambaforge/envs/clair3/bin/models/r941_prom_hac_g360+g422 --output=/clair3_output --threads=8 --platform=ont --snp_min_af=0.3 --min_coverage=10 --use_whatshap_for_final_output_haplotagging --call_snp_only
...
##contig=<ID=KI270419.1,length=1029>
##contig=<ID=KI270336.1,length=1026>
##contig=<ID=KI270312.1,length=998>
##contig=<ID=KI270539.1,length=993>
##contig=<ID=KI270385.1,length=990>
##contig=<ID=KI270423.1,length=981>
##contig=<ID=KI270392.1,length=971>
##contig=<ID=KI270394.1,length=970>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
##cmdline=/mambaforge/envs/clair3/bin/run_clair3.sh --bam_fn=input.bam --ref_fn=/ref_files/hg38/no_chr/Homo_sapiens.GRCh38.dna.primary_assembly.fa --model_path=mambaforge/envs/clair3/bin/models/r941_prom_hac_g360+g422 --output=/clair3_output --threads=8 --platform=ont --snp_min_af=0.3 --min_coverage=10 --use_whatshap_for_final_output_haplotagging --call_snp_only
1       10108   .       C       CT      10.79   PASS    F       GT:GQ:DP:AD:AF  1/1:10:245:89,86:0.3510
1       10321   .       C       T       5.62    PASS    F       GT:GQ:DP:AD:AF  0/1:5:310:122,131:0.4226
1       10622   .       T       G       22.59   PASS    P       GT:GQ:DP:AD:AF  1/1:22:378:10,289:0.7646

I'm not sure what is causing this but my hacky fix was to remove that line from the temporary vcfs (merge_{n}.vcf) and rerun clair3_c_impl.sh from the whatshap phase step... I removed the incorrectly placed vcf lines with this grep command (ran in the tmp/mergedoutput/ directory), although now the metadata will be missing the command line record (oh well): `for i in merge*.vcf; do mv $i $i.tmp; cat $i.tmp | grep -v "##cmdline" > $i; done`

Thanks, Chris

Original Clair3 command line:

run_clair3.sh --bam_fn=input.bam --ref_fn=path_to/ref_files/hg38/no_chr/Homo_sapiens.GRCh38.dna.primary_a
ssembly.fa --threads=8 --platform=ont --snp_min_af=0.3 --indel_min_af=0.3 --model_path=path_to/mambaforge/envs/clair
3/bin/models/r941_prom_hac_g360+g422 --output=path_to/clair3_output --use_what
shap_for_final_output_haplotagging
zhengzhenxian commented 2 weeks ago

Hi @blackbeerd,

Many thanks for reporting this. Sad that we cannot repeat the issue locally.

Could you share the logs ${OUPUT_DIR}/run_clair3.log and ${OUPUT_DIR}/log or send them to my email zxzheng@cs.hku.hk? That would help us to understand the malformed VCF header.

Zhenxian

blackbeerd commented 2 weeks ago

Sure - I will send them via email if the files are not too large.


From: zhenxian @.> Sent: Monday, September 2, 2024 4:21:55 AM To: HKU-BAL/Clair3 @.> Cc: Christopher Boniface @.>; Mention @.> Subject: [EXTERNAL] Re: [HKU-BAL/Clair3] malformed metadata and header in temporary vcfs when running clair3 with whatshap phase with large bam file (Issue #336)

Hi @blackbeerdhttps://urldefense.com/v3/__https://github.com/blackbeerd__;!!Mi0JBg!LGHXRf_72reW8i1t44vQdfgUN9rssSfT4X6x8xkPatzkbFCoUYZOJn9cSh9oRYILsHAyT8cH7wekVVKJVWI7BP0H$,

Many thanks for reporting this. Sad that we cannot repeat the issue locally.

Could you share the logs ${OUPUT_DIR}/run_clair3.log and ${OUPUT_DIR}/log or send them to my email @.**@.>? That would help us to understand the malformed VCF header.

Zhenxian

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/HKU-BAL/Clair3/issues/336*issuecomment-2324490734__;Iw!!Mi0JBg!LGHXRf_72reW8i1t44vQdfgUN9rssSfT4X6x8xkPatzkbFCoUYZOJn9cSh9oRYILsHAyT8cH7wekVVKJVUrxJX2d$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGWRDKO6XY5RXZ3PXJFGGMLZURC5HAVCNFSM6AAAAABNNJPJKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUGQ4TANZTGQ__;!!Mi0JBg!LGHXRf_72reW8i1t44vQdfgUN9rssSfT4X6x8xkPatzkbFCoUYZOJn9cSh9oRYILsHAyT8cH7wekVVKJVeu-a2_c$. You are receiving this because you were mentioned.Message ID: @.***>