DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

Ratatosk correct with data from several individuals #53

Closed mbcarbonetto closed 1 week ago

mbcarbonetto commented 1 week ago

Hello! First of all thanks for the tool :)

I have been trying to run:

Ratatosk correct -v -c 60 -s fq_short.txt -l long_list.txt -o ratatosk1

in a CPU with up to 3Tb available RAM. It has been running for up to 21 days. Then it suddenly stopped.

The list of short Illumina reads files comprises six Fw (or R1) files and six Rv (or R2) summing up to 300 Gb. These are compressed files .gz.

The Pacbio Long reads files sum up to 150Gb (from 3 smart cells) for individual A and up to 112GB (from 3 smart cells) for individual B. Short reads are from a third individual C.

Do you think this long time is due to the fact that data comes from three different individuals? Do you recommend to separate individual A and B for correction? Moreover, is it correct to use individual C for this kind of correction of reads?

I copy here the log:

Ratatosk::Ratatosk(): Building graph from short reads (1/2). KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2) KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2) KmerStream::KmerStream(): Finished CompactedDBG::build(): Estimated number of k-mers occurring at least once: 62157203100 CompactedDBG::build(): Estimated number of minimizer occurring at least once: 14406171545 CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 12894861533 CompactedDBG::build(): Estimated number of minimizers occurring twice or more: 3123228012 CompactedDBG::filter(): Processed 391256119509 k-mers in 5485644215 reads CompactedDBG::filter(): Found 62096381885 unique k-mers CompactedDBG::filter(): Number of blocks in Bloom filter is 88148469 CompactedDBG::construct(): Extract approximate unitigs (1/2) CompactedDBG::construct(): Extract approximate unitigs (2/2) CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2) CompactedDBG::construct(): Before split: 741627113 unitigs CompactedDBG::construct(): After split (1/2): 720078039 unitigs CompactedDBG::construct(): After split (2/2): 737095402 unitigs CompactedDBG::construct(): Unitigs split: 8492141 CompactedDBG::construct(): Unitigs deleted: 21593167

CompactedDBG::construct(): Joining unitigs CompactedDBG::construct(): After join: 720645987 unitigs CompactedDBG::construct(): Joined 16449415 unitigs

CompactedDBG::write(): Writing graph to disk

CompactedDBG::write(): Writing index file to disk KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2) KmerStream::KmerStream(): Start computing k-mer cardinality estimations (1/2) KmerStream::KmerStream(): Finished CompactedDBG::build(): Estimated number of k-mers occurring at least once: 9087447091 CompactedDBG::build(): Estimated number of minimizer occurring at least once: 1808154960 CompactedDBG::filter(): Processed 36019170561 k-mers in 720645987 reads CompactedDBG::filter(): Found 9067253792 unique k-mers CompactedDBG::filter(): Number of blocks in Bloom filter is 62121222 CompactedDBG::construct(): Extract approximate unitigs (1/2) CompactedDBG::construct(): Extract approximate unitigs (2/2) CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2) CompactedDBG::construct(): Before split: 776974343 unitigs CompactedDBG::construct(): After split (1/1): 776976769 unitigs CompactedDBG::construct(): Unitigs split: 2672424 CompactedDBG::construct(): Unitigs deleted: 0

CompactedDBG::construct(): Joining unitigs CompactedDBG::construct(): After join: 759887597 unitigs CompactedDBG::construct(): Joined 25381575 unitigs Ratatosk::Ratatosk(): Adding colors and coverage to graph (1/2). Ratatosk::addCoverage(): Anchoring reads on graph.

Any advice will be very much appreciated.

thanks!

GuillaumeHolley commented 1 week ago

Hi @mbcarbonetto,

Are you using a cluster/cloud architecture to run Ratatosk? If yes, it is very likely that the job scheduler you use (Slurm or whatnot) has a default maximum running time limit on jobs and that limit is 21 days. This is probably the reason why the job suddenly stopped.

When you say "individuals", are we talking about human individuals? Because from the graph construction steps, I see in the log you posted that the graph is estimated to contain about 13 billion 63-mers but I would expect 2.5 to 3 billion 63-mers for a human individual. Even for 300GB of gzip files, 13 billion 63-mers seems too much. If this is supposed to be human data, I would recommend you check your Illumina input and make sure you don't have long reads mixed up with your short reads. If everything seems to be in order, try to use only max 2-3 pairs of R1-R2 files instead of the 6. Overall, I don't think Ratatosk takes advantage of more coverage after 50-60x while it will spend more time processing that extra coverage. If your input is not human, is the organism supposed to be about 12-13 Gbps in size?

Finally, you cannot use Ratatosk with short reads of individual C to correct the long reads of individuals A and B. I mean, you can but I expect Ratatosk to only output garbage then. As far as I know, no tool can do that. Basically, Ratatosk considers that the k-mers of the short reads are the ground truth for the correction. All k-mers that results from variants exclusive to either A and/or B will not be found in C and will be considered to be errors. At best, Ratatosk will not correct these false errors and will leave them as is. At worst, Ratatosk will "correct" the false errors with the next closest match in C. Either way, nothing good will come out of it.

Guillaume

mbcarbonetto commented 1 week ago

@GuillaumeHolley Thanks a lot for your super quick reply.

We do not have a cluster architecture, but maybe the sysadmin killed it anyway. I will ask.

About the individuals, sorry, my bad, this is not human data. It is a diploid arthropod and in order to get enough DNA we used three individuals. We are expecting a 6Gb genome size (haploid).

If I understand correctly what you are saying, I will loose all the haplotype information if I use different individuals for the short and long reads.

Thanks for this feedback, it was very useful.

cheers,

Belen

GuillaumeHolley commented 1 week ago

About the individuals, sorry, my bad, this is not human data. It is a diploid arthropod and in order to get enough DNA we used three individuals. We are expecting a 6Gb genome size (haploid).

I don't know much about arthropod genomes but I assume they have a (much?) higher level of heterozygosity compared to humans. Combined with what seems to be a high coverage sequencing, it could explain the 13 billion k-mers in the graph.

If I understand correctly what you are saying, I will loose all the haplotype information if I use different individuals for the short and long reads.

Exactly.

I think the matter is resolved so I'll be closing this issue. Don't hesitate to reach out if you wanna correct reads from the same individual some day :)

Guillaume