Closed bartcharbon closed 1 year ago
I tried but could not repeat the issue. I will leave the issue open in case anyone can provide more clues to the problem.
Hello,
It seems like input/output problem in Clair3.
I get same problem in my data when I run Clair3 as bellow:
run_clair3.sh --bam_fn=$1 --ref_fn=myreference.fa --threads=100 --model_path=/home/KT/Clair3-main/models/r941_prom_hac_g360+g422_1235 --platform=ont --output=$2
After I reduced the number of threads in the program, the results outputted on the same data no longer showed similar errors.
run_clair3.sh --bam_fn=$1 --ref_fn=myreference.fa --threads=8--model_path=/home/KT/Clair3-main/models/r941_prom_hac_g360+g422_1235 --platform=ont --output=$2
I speculate that the cause of this error may be related to multi-threaded output.
One can use “pysam.VariantFile” for vcf Scan to find whether and where the target vcf have such format errors.
import pysam
vcf_file1 = pysam.VariantFile(YourVCF, 'r')
Test = [record for record in vcf_file1] \
I hope this helps, KT
Could you please try running taskset -c 1 nproc
in your environment (must be the same as where you run Clair3) and let me know the result?
Here is the result: taskset -c 1 nproc 1
Till Oct 25th, we are still unable to repeat the problem. As a side note, to resolve messy newlines in GVCF output, reducing '--threads' is a good rule of thumb.
Sometimes we are getting gvcf output files from clair like:
chr1 1234 . C <NON_REF> 0 . END=1235;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0chr1 1245 . G <NON_REF> 0 . END=1246;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
instead of:chr1 1234 . C <NON_REF> 0 . END=1235;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0 chr1 1245 . G <NON_REF> 0 . END=1246;AC=0;AN=2 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
the newline is missing between the records.
When we try again with exactly the same input, the problem does not occus, and it will result in a valid gvcf file, unfortunately this makes it hard to give you specific steps on how to reproduce.
We see this behaviour with on several different cluster and with multiple different input files.