HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
247 stars 27 forks source link

Sometimes a newline is missing between the gvcf lines #200

Closed bartcharbon closed 1 year ago

bartcharbon commented 1 year ago

Sometimes we are getting gvcf output files from clair like: chr1 1234 . C <NON_REF> 0 . END=1235;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0chr1 1245 . G <NON_REF> 0 . END=1246;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0 instead of: chr1 1234 . C <NON_REF> 0 . END=1235;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0 chr1 1245 . G <NON_REF> 0 . END=1246;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0

the newline is missing between the records.

When we try again with exactly the same input, the problem does not occus, and it will result in a valid gvcf file, unfortunately this makes it hard to give you specific steps on how to reproduce.

We see this behaviour with on several different cluster and with multiple different input files.

aquaskyline commented 1 year ago

I tried but could not repeat the issue. I will leave the issue open in case anyone can provide more clues to the problem.

Goatofmountain commented 1 year ago

Hello, It seems like input/output problem in Clair3. I get same problem in my data when I run Clair3 as bellow: run_clair3.sh --bam_fn=$1 --ref_fn=myreference.fa --threads=100 --model_path=/home/KT/Clair3-main/models/r941_prom_hac_g360+g422_1235 --platform=ont --output=$2

After I reduced the number of threads in the program, the results outputted on the same data no longer showed similar errors. run_clair3.sh --bam_fn=$1 --ref_fn=myreference.fa --threads=8--model_path=/home/KT/Clair3-main/models/r941_prom_hac_g360+g422_1235 --platform=ont --output=$2

I speculate that the cause of this error may be related to multi-threaded output.

One can use “pysam.VariantFile” for vcf Scan to find whether and where the target vcf have such format errors. import pysam vcf_file1 = pysam.VariantFile(YourVCF, 'r') Test = [record for record in vcf_file1] \

I hope this helps, KT

aquaskyline commented 1 year ago

Could you please try running taskset -c 1 nproc in your environment (must be the same as where you run Clair3) and let me know the result?

Goatofmountain commented 1 year ago

Here is the result: taskset -c 1 nproc 1

aquaskyline commented 1 year ago

Till Oct 25th, we are still unable to repeat the problem. As a side note, to resolve messy newlines in GVCF output, reducing '--threads' is a good rule of thumb.