eblerjana / pangenie

Pangenome-based genome inference
MIT License
105 stars 10 forks source link

Haploid Haplotype Reconstruction #88

Open gsc74 opened 1 week ago

gsc74 commented 1 week ago

What is your question? @eblerjana I am working on reconstructing a haploid haplotype using the imputed genotypes from PanGenie. Currently, I am using the following commands:

PanGenie -i Reads.fq -r MHC-CHM13.ref.fa -v MHC_49-MC.vcf -o temp/APD_PG -t32 && bgzip temp/APD_PG_genotyping.vcf
tabix -p vcf temp/APD_PG_genotyping.vcf.gz && rm -rf APD_rec_PG.fasta
bcftools view -e 'GT="het"' temp/APD_PG_genotyping.vcf.gz | bgzip > temp/APD_PG_genotyping_no_homo.vcf.gz && tabix -p vcf temp/APD_PG_genotyping_no_homo.vcf.gz
bcftools consensus -f MHC-CHM13.ref.fa -o Rec_PG.fasta temp/APD_PG_genotyping_no_homo.vcf.gz

In the above commands, I am using haploid reads to obtain genotypes, then filtering the heterozygous variants, and finally using the filtered genotypes to reconstruct the haploid haplotype from the imputed filtered genotypes.

My question is: Is this the correct way to use PanGenie to reconstruct haplotypes? The input VCF is a phased diploid VCF generated by the minigraph-cactus pipeline and preprocessed with the "prepare-mc-vcf" pipeline.

If applicable: which version of PanGenie are you using? v3.1.0

If applicable: how did you run PanGenie? Please provide the command lines used. Did you run it using Singularity? I've used conda to install PanGenie

If applicable: what data are you running PanGenie on? Which species are you analyzing? Which input reads are used? How does the input VCF look like (number of input samples, how was it produced etc.)? MHC VCF file generated using Minigraph-Cactus pipeline and preprocessed using "prepare-mc-vcf" pipeline.

eblerjana commented 4 days ago

Hi @gsc74, in principle, I think that this approach makes sense. The only thing I'm not sure about is how well the genotyping would work for a haploid sample, as the model underlying PanGenie assumes a diploid genome and we have never specifically evaluated how well it performs for haploid data. But I think it is worth a try. Probably the rate of heterozygous genotypes PanGenie predicts would be an indication of how good the genotyping works.