HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
244 stars 26 forks source link

Representation Unification Problem Execution #84

Closed ghost closed 2 years ago

ghost commented 2 years ago

Hello!

I am trying with the RepresentationUnification module to get a unified VCF from an ONT BAM of about 100GB. I've been trying to run the RepresentationUnification module on RTX2070 and 1080 for several days but I get nothing, the processes consume all the RAM and never finish.... Any advice?

zhengzhenxian commented 2 years ago

Hi, The Representation Unification module only consumes CPU resources, our experience is that RU takes around 20G per thread and the memory consumption relies heavily on the base errors level of ONT data, What Guppy version you are using for your ONT data?

You might try to add the option --partition_size=5 and --min_af=0.2 to step 5 of RU and to check if it could run to an end. These settings are for debugging only. To train a model for production, we won't suggest a high --min_af or a low --partition_size.

ghost commented 2 years ago

I have been testing with other versions of Guppy and it works perfect for me! Thank you very much! But one doubt I have... in the Training section, the ONT HG002 data is split into 1, 2, 3... Does this mean that you download the different parts and merge them? Or do you treat them independently?

zhengzhenxian commented 2 years ago

That‘s great to hear it works!

Pileup training takes input bam to create pileup tensors, handles the bam in parallel with multiple coverages, chromosomes and chunks. Full-alignment training takes the phased alignment as input, and for efficiency, we split the input bam into multiple phased bams(in whatshap haplotag step), and for each chromosome, we also handle them into multiple chunks, and then create tensors and merge into binaries for training.

So, we consider only one input bam(including all contigs) to handle for each sample, and we split and handle the input bam independently for efficiency.

ghost commented 2 years ago

I am referring to the data before the alignment, for example in HG002 of the Training section, if I access the link to download the data, a very long list of FASTQ appears, and I do not understand if these have been aligned independently to obtain a huge dataset or if the ones I underline in yellow in the screenshot have been concatenated and then aligned...

https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=NHGRI_UCSC_panel/HG002/nanopore/Guppy_4.2.2/

imagen

zhengzhenxian commented 2 years ago

Oh, these raw fastqs are different GridION sequence runs of HG002, you can either 1. merge these fastqs to get a large fastqs file and then do alignment to get a high-coverage BAM file, or 2. align each fastq to get multiple low-coverage BAMs and then apply samtools merge to get the same high-coverage alignment as step 1.

In this case, GM24385_x_Guppy_4.2.2_prom.fastq.gz(x from 1-3) will generate a HG002 WGS BAM file with ~85x, if you align all the fastqs in the link, the coverage will reach ~432x.

Also, these data are basecalled by Guppy4.2.2, if you don't stick in this specific Guppy version, we suggest you use more accurate Guppy5 data here, We have basecalled the Guppy5 hac and sup model for HG002-HG005 data. Hope it help!