eblerjana / pangenie

Pangenome-based genome inference
MIT License
103 stars 10 forks source link

The best method to use the genotyping-pipelines and the vcf-merging to quality control #68

Closed ld9866 closed 6 months ago

ld9866 commented 6 months ago

Dear developer: We are using livestock for pan-genome analysis, and we find that the genotyping-pipelines (https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC), although used for human genome analysis, are also good for our species, but I also notice the process of vcf-merging (https://bitbucket.org/jana_ebler/vcf-merging/src/master/pangenome-graph-from-callset/). Both of these methods can be used for the subsequent Pangenie analysis of quality control vcf results. I would like to ask how we can combine these two methods. Do we use the genotyping-pipelines first and then vcf-merging, or do we just use vcf-merging? Best day!

ld9866 commented 6 months ago

Our genome is about 2.5Gb and befor we use the prepare-vcf-MC pipeline there are 43108274 variant(SNP, Indel and SV), after filter is 42010356 variant. When we used the vcf-merging pipeline, after filter is 30934452 variant. Since we are not familiar with the subsequent analysis process, I would like to ask you how to deal with this situation. Our genome is 27.

eblerjana commented 6 months ago

As described in the README, the pipeline https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC is specifically for VCFs produced by the minigraph-cactus pipeline. It only works for human data.

The pipeline https://bitbucket.org/jana_ebler/vcf-merging/src/master/pangenome-graph-from-callset/ is generally designed for input VCFs that were produced from alignments of the assemblies to a linear reference genome instead of a pangenome graph.

Can you please provide more details on your input data and how the VCF was generated that you want to use with pangenie? Did you generate the VCF from the graph or from alignments of the assemblies to a reference genome? Does the VCF contain overlapping variant records? Which species are you analyzing? Is it a diploid species?

eblerjana commented 6 months ago

As I mentioned above, the pipeline: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC in its present state works for human data only, because it assumes the underlying reference genome to be either GRCh38 or CHM13. If your VCF was produced by the Minigraph-Cactus pipeline, it can be used as input to PanGenie right away. Running any preprocessing is not necessary. The "prepare-vcf-MC" pipeline is useful for analyzing variants nested in the bubbles after genotyping, but again, it is not necessary for PanGenie, so you don't need to run it.

It might be a good idea though to remove variants from your VCF that contain many dots in the genotypes (".|.", ".|x", "x|."). Typically, we remove positions for which more than 20% of the haplotypes have dots in their genotypes.

ld9866 commented 6 months ago

Ok, thank you for your patient reply. I will conduct follow-up tests according to what you said.