HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
244 stars 26 forks source link

Seek for help running representation unifcation #121

Closed shiying-sxu closed 2 years ago

shiying-sxu commented 2 years ago

Hi, I met a problem in the step of representation unification. After I execute the command according to representation_unification.md, my vcf output folder is empty. I carefully check the steps and find that the CreateTensorFullAlignment command is missing in step 4 in the representation_unification.md file provided now, but after I add this command Found that the files in my CANDIDATE_DETAILS_PATH are all empty. Can you help me? Sincerely

aquaskyline commented 2 years ago

RU doesn't require CreateTensorFullAlignment and we are unable to pinpoint the problem per your description. We suggest you check the output of each step to ensure they are intact.

shiying-sxu commented 2 years ago

The result of my running is as follows, can you help me to see what is wrong?

This is WhatsHap 1.2.1 running under Python 3.6.10 Working on 1 samples from 1 family Leaving chromosome 'chr1' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr2' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr3' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr4' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr5' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr6' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr7' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr8' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr9' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr10' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr11' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr12' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr13' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr14' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr15' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr16' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr17' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr18' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr19' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr20' unchanged (present in VCF but not requested by option --chromosome) Leaving chromosome 'chr21' unchanged (present in VCF but not requested by option --chromosome) ======== Working on chromosome 'chr22' ---- Processing individual HG001 Using maximum coverage per sample of 15X Number of variants skipped due to missing genotypes: 0 Number of remaining heterozygous variants: 24447 Reading alignments and detecting alleles ... Found 84382 reads covering 24442 variants Kept 66806 reads that cover at least two variants each Reducing coverage to at most 15X by selecting most informative reads ... Selected 18985 reads covering 24437 variants Variants covered by at least one phase-informative read in at least one individual after read selection: 24437 Phasing 1 sample by solving the MEC problem ... MEC cost: 358080 No. of phased blocks: 310 Largest block contains 9714 variants (39.8% of accessible variants) between position 30263536 and 42499006 ======== Writing VCF Done writing VCF Changed 120 genotypes while writing VCF Leaving chromosome 'chrX' unchanged (present in VCF but not requested by option --chromosome)

== SUMMARY == Maximum memory usage: 0.417 GB Time spent reading BAM/CRAM: 82.1 s Time spent parsing VCF: 57.8 s Time spent selecting reads: 8.1 s Time spent phasing: 125.4 s Time spent writing VCF: 48.8 s Time spent finding components: 4.3 s Time spent on rest: 32.2 s Total elapsed time: 358.6 s Found 1 sample(s) in input VCF Found 1 sample(s) in BAM file Reading alignments and detecting alleles ... Found 84015 reads covering 24143 variants

== SUMMARY == Total alignments processed: 431114 Alignments that could be tagged: 83055 Alignments spanning multiple phase sets: 0 Finished in 211.6 s [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [faidx] Truncated sequence: chr22:46903593-50973515 [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files [mpileup] 1 samples in 1 input files cat: '/work/Clair3-main/data/outputref-HG001_GRCh38/vcfoutput/vcf*': 没有那个文件或目录 [WARNING] No vcf file found, please check the setting [WARNING] No variant found, please check the setting

zhengzhenxian commented 2 years ago

Hi, seem no record found in your VCF output, are you following the steps here for RU? There is no need to add CreateTensorFullAlignment step in RU, as we have included this step in UnifyRepresentation submodule, pls check the code here for more details.

shiying-sxu commented 2 years ago

The command I execute is as follows

Setup variables

CLAIR3="/work/Clair3-main/clair3.py" # clair3.py PYPY="/home/user/anaconda3/envs/clair3/bin/pypy3.6" # e.g. pypy3 WHATSHAP="/home/user/anaconda3/envs/clair3/bin/whatshap" # e.g. whatshap PARALLEL="/home/user/anaconda3/envs/clair3/bin/parallel" # e.g. parallel TABIX="/home/user/anaconda3/envs/clair3/bin/tabix" # e.g. tabix SAMTOOLS="/home/user/anaconda3/envs/clair3/bin/samtools" # e.g. samtools

Input parameters

PLATFORM="ont" # e.g. {ont, hifi, ilmn} VCF_FILE_PATH="/work/Clair3-main/data/datatest/HG001/GRCh38/GRCh38.vcf.gz" # [YOUR_VCF_FILE_PATH]e.g. hg003.vcf.gz BAM_FILE_PATH="/work/Clair3-main/data/datatest/HG001/GRCh38/GRCh38.bam" # e.g. hg003.bam alignment.bam REFERENCE_FILE_PATH="/work/Clair3-main/data/datatest/HG001/GRCh38/GRCh38.fa" # e.g. hg003.fasta BED_FILE_PATH="/work/Clair3-main/data/datatest/HG001/GRCh38/GRCh38.bed" # e.g. hg003.bed OUTPUT_DIR="/work/Clair3-main/data/outputref-HG001_GRCh38" # e.g. output

Chromosome prefix ("chr" if chromosome names have the "chr" prefix)染色体前缀

CHR_PREFIX="chr"

array of chromosomes (do not include "chr"-prefix)

CHR=(22)

Number of threads to be used

THREADS=24

The number of chucks to be divided into for parallel processing

并行加工需分割的卡盘数量

chunk_num=15 CHUNK_LIST=seq 1 ${chunk_num}

Minimum AF required for a candidate variant

MIN_AF=0.08

Temporary working directory

SPLIT_BED_PATH="${OUTPUT_DIR}/split_beds" VCF_OUTPUT_PATH="${OUTPUT_DIR}/vcf_output" VAR_OUTPUT_PATH="${OUTPUT_DIR}/var" PHASE_VCF_PATH="${OUTPUT_DIR}/phased_vcf" PHASE_BAM_PATH="${OUTPUT_DIR}/phased_bam"

mkdir -p ${SPLIT_BED_PATH} mkdir -p ${VCF_OUTPUT_PATH} mkdir -p ${VAR_OUTPUT_PATH} mkdir -p ${PHASE_VCF_PATH} mkdir -p ${PHASE_BAM_PATH}

2. Phase VCF file using WhatsHap

To apply representation unification, using a phased read alignment is highly recommended in order to get more precious unified result.

WhatsHap phasing vcf file if vcf file includes '|' in INFO tag

cd ${OUTPUT_DIR}

WhatsHap phasing vcf file if vcf file includes '|' in INFO tag

${WHATSHAP} unphase ${VCF_FILE_PATH} > ${OUTPUT_DIR}/INPUT.vcf.gz

WhatsHap phase vcf file

${PARALLEL} --joblog ${PHASE_VCF_PATH}/phase.log -j${THREADS} \ "${WHATSHAP} phase \ --output ${PHASE_VCFPATH}/phased{1}.vcf.gz \ --reference ${REFERENCE_FILE_PATH} \ --chromosome ${CHR_PREFIX}{1} \ --ignore-read-groups \ --distrust-genotypes \ ${OUTPUT_DIR}/INPUT.vcf.gz \ ${BAM_FILE_PATH}" ::: ${CHR[@]}

Index phased vcf file

${PARALLEL} -j ${THREADS} tabix -p vcf ${PHASE_VCFPATH}/phased{1}.vcf.gz ::: ${CHR[@]}

```

3. Haplotag read alignment using WhatsHap

WhatsHap haplotags bam file

${PARALLEL} --joblog ${PHASE_BAM_PATH}/haplotag.log -j${THREADS} \ "${WHATSHAP} haplotag \ --output ${PHASE_BAM_PATH}/{1}.bam \ --reference ${REFERENCE_FILE_PATH} \ --regions ${CHR_PREFIX}{1} \ --ignore-read-groups \ ${PHASE_VCFPATH}/phased{1}.vcf.gz \ ${BAM_FILE_PATH}" ::: ${CHR[@]}

Index the phased bam file using samtools

${PARALLEL} --joblog ${PHASE_BAM_PATH}/index.log -j ${THREADS} ${SAMTOOLS} index -@12 ${PHASE_BAM_PATH}/{1}.bam ::: ${CHR[@]}

4.Prepare true variant set

Split bed file regions according to the contig name and extend bed region

${PARALLEL} --joblog ${SPLIT_BED_PATH}/split_extend_bed.log -j${THREADS} \ "${PYPY} ${CLAIR3} SplitExtendBed \ --bed_fn ${BED_FILE_PATH} \ --output_fn ${SPLIT_BED_PATH}/{1} \ --ctgName ${CHR_PREFIX}{1}" ::: ${CHR[@]}

Get true variant label information from VCF file

${PARALLEL} --joblog ${VAR_OUTPUT_PATH}/get_truth.log -j${THREADS} \ "${PYPY} ${CLAIR3} GetTruth \ --vcf_fn ${PHASE_VCFPATH}/phased{1}.vcf.gz \ --ctgName ${CHR_PREFIX}{1} \ --var_fn ${VAR_OUTPUTPATH}/var{1}" ::: ${CHR[@]}

5. Unify Representation for true variant set and candidate sites

${PARALLEL} --joblog ${OUTPUT_DIR}/unify_repre.log -j${THREADS} \ "${PYPY} ${CLAIR3} UnifyRepresentation \ --bam_fn ${PHASE_BAM_PATH}/{1}.bam \ --var_fn ${VAR_OUTPUTPATH}/var{1} \ --ref_fn ${REFERENCE_FILE_PATH} \ --bed_fn ${BED_FILE_PATH} \ --extend_bed ${SPLIT_BED_PATH}/{1} \ --output_vcf_fn ${VCF_OUTPUTPATH}/vcf{1}_{2} \ --min_af ${MIN_AF} \ --chunk_id {2} \ --chunk_num ${chunk_num} \ --platform ${PLATFORM} \ --ctgName ${CHR_PREFIX}{1}" ::: ${CHR[@]} ::: ${CHUNK_LIST[@]} > ${OUTPUT_DIR}/RU.log

6. Merge and sort unified VCF output

cat ${VCF_OUTPUTPATH}/vcf* | ${PYPY} ${CLAIR3} SortVcf --output_fn ${OUTPUT_DIR}/unified.vcf bgzip -f ${OUTPUT_DIR}/unified.vcf tabix -f -p vcf ${OUTPUT_DIR}/unified.vcf.gz

zhengzhenxian commented 2 years ago

Still no hint for the empty output. Could you check each file's validity in step 5? We speculated that some files are empty or missing in your step 5.

shiying-sxu commented 2 years ago

Still no hint for the empty output. Could you check each file's validity in step 5? We speculated that some files are empty or missing in your step 5.

After I reinstalled the environment, it can run correctly, I think it should be an environment problem. In addition, I have a problem that when I try to run quickdemo, I get an error no module named libclair3, and I can't install it correctly with the command conda install libclair3

zhengzhenxian commented 2 years ago

After I reinstalled the environment, it can run correctly, I think it should be an environment problem. In addition, I have a problem that when I try to run quickdemo, I get an error no module named libclair3, and I can't install it correctly with the command conda install libclair3

libclair3 is built via this command(source activate clair3 && make PREFIX=${CONDA_PREFIX}) in installation option 4.