Shuhua-Group / CPC-graph-based-NGS-pipeline

A graph-based pipeline used to call/genotype snvs/indels/SVs from NGS data
15 stars 2 forks source link

Using 30x NGS real data test #1

Closed JiadongZHONG closed 1 year ago

JiadongZHONG commented 1 year ago

Hi, Dr Xu,

I downloaded CPC & HPRC minaf0.1 version to test this pipeline on my 30x NGS real data, I encountered some issues:

1. Why did I get two reference paths? In the pipepline, I have finished task 01.graph_mapping and 05.vg_call this two steps. But in my vgcall. vcf file, I noticed that it contains variants information of GRCh38 chromosomes and contigs. I roughly checked the number of variants in vcf, which is approximately 4 million on the CHM13v2 chromosome and only a few thousand on GRCh38. I'm not sure if this is caused by mapping the sequencing data onto two reference paths? Additionally, when I am performing step 02.suject_to_bam I also found information of GRCh38 as the reference coordinate in my sort.bam file, which resulted in incompatibility with my linear reference CHM13v2. fa in the next step 03.bam_processing.

2. About the reference VCF files needed by PanGenie On your official website, I did not see the reference vcf file you provided for this input, so I used vg deconstruct to generate this file. I am not sure if this is correct, and the code is as follows,

“vg deconstruct CPC.HPRC.Phase1.CHM13v2-minaf.0.1.xg \ -P "CHM13v2" \ -g CPC.HPRC.Phase1.CHM13v2-minaf.0.1.gbwt \ -r CPC.HPRC.Phase1.CHM13v2-minaf.0.1.snarls \ -a > CPC.HPRC.Phase1.CHM13v2-minaf.0.1.pangenie.vcf ”

3. About memory usage I used 360-GB and 36-cores for testing 10 samples at once, but still got an error prompt of 'out of memory', I found that the task was interrupted while mapping the 4th sample, is this because CPC&HPRC requires more memory than CPC

TanXinjiang commented 1 year ago

Hi, Jiadong: Thank you for your interest on our pangenome graph and alpha-test pipeline, and there are actually many bugs to be fixed. 1. About two reference paths: The CPC-HPRC merge graph we released havd both coordinates due to the "--xgReference GRCh38" parameter added on the final join step, which in hindsight proved to be completely useless. We therefore recommend using a separate CPC or HPRC CHM13 reference graph, or using vg chunk -t 128 -x CPC.HPRC.Phase1.CHM13v2-minaf.0.1.xg -C $(for i in {1..22} X Y M;do echo -p CHM13v2.chr$i ;done) -O gfa to get the CHM13 chromsomal graphs and generate a new genome-wide reference map after merging, please refer to "cactus-graphmap-join" in MC pipeline to get the specific commands. 2. About the reference VCF files needed by PanGenie To be used for Pangenie's input, the deconstructed vcf still requires multiple processing steps, including the removing GRCh38 sample, use of vcfbub to remove large bubbles, and vcfwave+vcfcreatemulti to deconstruct complex variants, for which you can look at the methods section of the HPRC article. 3. About memory usage We highly recommend that you use snakemake for task scheduling, in the script we have configured the parameters based on the estimated resource consumption, for example 01.vg_giraffe typically requires about 50G of memory, so the snakemake system will schedule the number of parallel tasks based on the server's memory to avoid memory overflow. Any other suggestions are welcome! Best wishes! Xinjiang Tan

JiadongZHONG commented 1 year ago

Many thanks for your help, Xinjiang!