KolmogorovLab / hapdup

Pipeline to convert a haploid assembly into diploid
Other
85 stars 8 forks source link

Question about no result from Margin #27

Open yzhang-github-pub opened 1 year ago

yzhang-github-pub commented 1 year ago

Dear author,

Thanks for developing hapdup, which works well most of time for me. But occasionally Margin failed even pepper produced as expected variant calls. What parameter(s) I can tune? Please advise.

Here is an example margin log file when 3 variant calls from pepper were expected but margin didn't keep any. Is there a way to loose the criteria?

_Parsed 3 total VCF entries from /sample1/hapdup/pepper/PEPPER_VARIANTFULL.vcf; kept 0 HETs, skipped 0 for region, 1 for not being PASS, 2 for being homozygous, 0 for being INDEL No valid VCF entries found!

mikolmogorov commented 1 year ago

Looks like PEPPER did not find any heterozygous variants in your assembly. Could it be homozygous? If you tell us more about the genome and your dataset and provide PEPPER and Mergin logs, I should be able to help more.

yzhang-github-pub commented 1 year ago

peper log:

_[04-04-2023 13:59:45] INFO: ONT VARIANT CALLING MODE SELECTED. [04-04-2023 13:59:45] INFO: MODE: PEPPER SNP [04-04-2023 13:59:45] INFO: THRESHOLDS ARE SET TO: [04-04-2023 13:59:45] INFO: MIN MAPQ: 5 [04-04-2023 13:59:45] INFO: MIN SNP BASEQ: 1 [04-04-2023 13:59:45] INFO: MIN INDEL BASEQ: 1 [04-04-2023 13:59:45] INFO: MIN SNP FREQUENCY: 0.1 [04-04-2023 13:59:45] INFO: MIN INSERT FREQUENCY: 0.15 [04-04-2023 13:59:45] INFO: MIN DELETE FREQUENCY: 0.15 [04-04-2023 13:59:45] INFO: MIN COVERAGE THRESHOLD: 3 [04-04-2023 13:59:45] INFO: MIN CANDIDATE SUPPORT: 2 [04-04-2023 13:59:45] INFO: MIN SNP CANDIDATE FREQUENCY: 0.1 [04-04-2023 13:59:45] INFO: MIN INDEL CANDIDATE FREQUENCY: 0.1 [04-04-2023 13:59:45] INFO: SKIP INDEL CANDIDATES: False [04-04-2023 13:59:45] INFO: MAX ALLOWED CANDIDATE IN ONE SITE: 4 [04-04-2023 13:59:45] INFO: MIN SNP PREDICTIVE VALUE: 0.1 [04-04-2023 13:59:45] INFO: MIN INSERT PREDICTIVE VALUE: 0.25 [04-04-2023 13:59:45] INFO: MIN DELETE PREDICTIVE VALUE: 0.25 [04-04-2023 13:59:45] INFO: SNP QV CUTOFF FOR RE-GENOTYPING: 15 [04-04-2023 13:59:45] INFO: INDEL QV CUTOFF FOR RE-GENOTYPING: 10 [04-04-2023 13:59:45] INFO: REPORT ALL SNPs ABOVE THRESHOLD: 0 [04-04-2023 13:59:45] INFO: REPORT ALL INDELs ABOVE THRESHOLD: 0 [04-04-2023 13:59:45] INFO: CALL VARIANT MODULE SELECTED [04-04-2023 13:59:45] INFO: RUN-ID: 04042023_135945 [04-04-2023 13:59:45] INFO: IMAGE OUTPUT: /temp/sample_test1/hapdup/pepper/images_04042023_135945/ [04-04-2023 13:59:45] INFO: STEP 1/3 GENERATING IMAGES: [04-04-2023 13:59:45] INFO: COMMON CONTIGS FOUND: ['26530'] [04-04-2023 13:59:45] INFO: TOTAL CONTIGS: 1 TOTAL INTERVALS: 1 TOTAL BASES: 11579 [04-04-2023 13:59:46] INFO: STARTING PROCESS: 0 FOR 1 INTERVALS [04-04-2023 13:59:46] INFO: THREAD 0 FINISHED SUCCESSFULLY. [04-04-2023 13:59:46] INFO: FINISHED IMAGE GENERATION [04-04-2023 13:59:46] INFO: TOTAL ELAPSED TIME FOR GENERATING IMAGES: 0 Min 0 Sec [04-04-2023 13:59:46] INFO: STEP 2/3 RUNNING INFERENCE [04-04-2023 13:59:46] INFO: OUTPUT: /temp/sample_test1/hapdup/pepper/predictions_04042023_135945/ [04-04-2023 13:59:46] INFO: DISTRIBUTED CPU SETUP. [04-04-2023 13:59:46] INFO: TOTAL CALLERS: 16 [04-04-2023 13:59:46] INFO: THREADS PER CALLER: 1 [04-04-2023 13:59:46] INFO: MODEL LOADING TO ONNX [04-04-2023 13:59:46] INFO: SAVING MODEL TO ONNX /usr/local/lib/python3.8/dist-packages/torch/onnx/symbolic_opset9.py:2095: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. warnings.warn("Exporting a model to ONNX with a batch_size other than 1, " + [04-04-2023 13:59:47] INFO: SETTING THREADS TO: 1. [04-04-2023 13:59:47] INFO: STARTING INFERENCE. [04-04-2023 13:59:47] INFO: TOTAL SUMMARIES: 0. [04-04-2023 13:59:47] INFO: THREAD 0 FINISHED SUCCESSFULLY. [04-04-2023 13:59:47] INFO: FINISHED PREDICTION [04-04-2023 13:59:47] INFO: ELAPSED TIME: 0 Min 0 Sec [04-04-2023 13:59:47] INFO: PREDICTION FINISHED SUCCESSFULLY. [04-04-2023 13:59:47] INFO: TOTAL ELAPSED TIME FOR INFERENCE: 0 Min 1 Sec [04-04-2023 13:59:47] INFO: STEP 3/3 FINDING CANDIDATES [04-04-2023 13:59:47] INFO: OUTPUT: /temp/sampletest1/hapdup/pepper/ [04-04-2023 13:59:47] INFO: STARTING CANDIDATE FINDING. [04-04-2023 13:59:47] INFO: FINISHED PROCESSING, TOTAL CANDIDATES FOUND: 3 [04-04-2023 13:59:47] INFO: FINISHED PROCESSING, TOTAL VARIANTS IN PEPPER: 0 [04-04-2023 13:59:47] INFO: FINISHED PROCESSING, TOTAL VARIANTS SELECTED FOR RE-GENOTYPING: 3 [04-04-2023 13:59:47] INFO: TOTAL TIME SPENT ON CANDIDATE FINDING: 0 Min 0 Sec [04-04-2023 13:59:47] INFO: TOTAL ELAPSED TIME FOR FINDING CANDIDATES: 0 Min 1 Sec

margin log:

_> Parsing model parameters from file: /opt/margin_params/phase/allParams.haplotag.ont-r94g507.hapDup.json

Parsed 3 total VCF entries from /temp/sample_test1/hapdup/pepper/PEPPER_VARIANTFULL.vcf; kept 0 HETs, skipped 0 for region, 1 for not being PASS, 2 for being homozygous, 0 for being INDEL No valid VCF entries found!

mikolmogorov commented 1 year ago

It seems like there were no heterozygous SNPs found by PEPPER, so Marigin failed. If your genome is haploid, Hapdup is not applicable, as it is designed to phased out diploid contigs.

yzhang-github-pub commented 1 year ago

The sample is diploid. For the same sample we detected SNPs from illumina data. And from the same nanopore data, clair3 called the expected variants. I wonder if users can loose the stringency in pepper/margin?

mikolmogorov commented 1 year ago

How many Illumina SNPs did you detect (with allele frequency above ~25%)? It may just be too few, 3 calls is not enough to phase. The expectation is human-like heterozygosity rate (e.g. 0.1%), that are distributed relatively uniformly.

yzhang-github-pub commented 1 year ago

According to clair3 on nanopore input, the frequency is over 25%, as shown below:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

2 6760 . G T 19.84 PASS P GT:GQ:DP:AF 0/1:19:114:0.4298 2 7705 . G T 20.82 PASS P GT:GQ:DP:AF 0/1:20:110:0.4000 2 9604 . C G 15.62 PASS P GT:GQ:DP:AF 0/1:15:64:0.5312

mikolmogorov commented 1 year ago

Thanks for the info. I think 3 SNPs may be just too few for Margin, it was really designed for phasing long-ish genomic segments with relatively uniform variants distribution.