Closed ghost closed 2 years ago
Hi, The Representation Unification module only consumes CPU resources, our experience is that RU takes around 20G per thread and the memory consumption relies heavily on the base errors level of ONT data, What Guppy version you are using for your ONT data?
You might try to add the option --partition_size=5
and --min_af=0.2
to step 5 of RU and to check if it could run to an end. These settings are for debugging only. To train a model for production, we won't suggest a high --min_af
or a low --partition_size
.
I have been testing with other versions of Guppy and it works perfect for me! Thank you very much! But one doubt I have... in the Training section, the ONT HG002 data is split into 1, 2, 3... Does this mean that you download the different parts and merge them? Or do you treat them independently?
That‘s great to hear it works!
Pileup training takes input bam to create pileup tensors, handles the bam in parallel with multiple coverages, chromosomes and chunks. Full-alignment training takes the phased alignment as input, and for efficiency, we split the input bam into multiple phased bams(in whatshap haplotag step), and for each chromosome, we also handle them into multiple chunks, and then create tensors and merge into binaries for training.
So, we consider only one input bam(including all contigs) to handle for each sample, and we split and handle the input bam independently for efficiency.
I am referring to the data before the alignment, for example in HG002 of the Training section, if I access the link to download the data, a very long list of FASTQ appears, and I do not understand if these have been aligned independently to obtain a huge dataset or if the ones I underline in yellow in the screenshot have been concatenated and then aligned...
Oh, these raw fastqs are different GridION sequence runs of HG002, you can either 1. merge these fastqs to get a large fastqs file and then do alignment to get a high-coverage BAM file, or 2. align each fastq to get multiple low-coverage BAMs and then apply samtools merge
to get the same high-coverage alignment as step 1.
In this case, GM24385_x_Guppy_4.2.2_prom.fastq.gz
(x from 1-3) will generate a HG002 WGS BAM file with ~85x, if you align all the fastqs in the link, the coverage will reach ~432x.
Also, these data are basecalled by Guppy4.2.2, if you don't stick in this specific Guppy version, we suggest you use more accurate Guppy5 data here, We have basecalled the Guppy5 hac and sup model for HG002-HG005 data. Hope it help!
Hello!
I am trying with the RepresentationUnification module to get a unified VCF from an ONT BAM of about 100GB. I've been trying to run the RepresentationUnification module on RTX2070 and 1080 for several days but I get nothing, the processes consume all the RAM and never finish.... Any advice?