DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
95 stars 7 forks source link

Corrected PromethION reads are extremely short #18

Closed andreaswallberg closed 3 years ago

andreaswallberg commented 3 years ago

Dear developers,

We have ~30x coverage of Nanopore PromethION data from a heterozygous invertebrate with a large genome (18Gbp). We also have approximately 30-40x coverage of 10X Chromium Illumina short linked-reads. We ran a round of error correction on a large compute node with >4TB RAM and 168 cores. Our original Nanopore reads (84 million; 600Gbp) have an N50 of about 12 kbp and 72% have Q>8, and 56% have Q>10.

It finished within a week with no reported error but the output is not as we expected.

Among a set of 50M outputted reads totaling 4.7 Gbp, the N50 is only 85 bp and our longest read is only 2.2kbp.

Something has obviously gone wrong here. We have neither used the -t or -q flags at all in the command. An example with 80 cores in a trial run is this:

/usr/bin/time -v Ratatosk -v -c 80 $SHORT_READS $LONG_READS -o out_long_reads 2>&1 | tee run_ratatosk.sh.log

Here $SHORT_READS are multiple appended " -s " and $LONG_READS are multiple " -l ".

The short-reads are sorted such that read 2 always follow after read 1. Because these are chromium libraries, the 16 bp barcode has been taken off from read 1 and added to a BX read header tag.

The long-read data is generally sorted such that shorter and lower quality long reads are passed into the program early and longer/more high quality data is passed later.

Any suggestions what might be going on with this data in Ratatosk?

Grateful for feedback!

GuillaumeHolley commented 3 years ago

Dear @andreaswallberg,

Thank you for opening this issue and providing all the details of your experiment.

At the moment I have no idea of what happened, this has never been an issue before but I am sure this is nothing we cannot fix. What is very troubling in this case is not only the extremely short N50 but also the fact that you do not have in output the same number of reads as in input. Since your input is 84 million reads sorted by increasing length/quality and you get only 50 million reads in output, is it possible that the job has been interrupted for some reasons? I have never worked with Chromium libraries before so it is unclear to me if this might be an issue here. What's really important for Ratatosk is that short read 1 and 2 from the same pair get the same name in the fasta/fastq file. Otherwise (with different names), they will be considered as 2 different single-end reads. I see that you have re-directed all the output of Ratatosk to a log file and it would help a lot if I could have a look at it, would that be possible?

Guillaume

andreaswallberg commented 3 years ago

Thanks for the rapid feedback!

We are collecting information and will make an update here as soon as possible.

May I ask you to please update the code such that Ratatosk logs to output, the specific input sequence files it opens and reads from? It appears to be the case it has not actually read from all files.

/Andreas

percyfal commented 3 years ago

Hi @GuillaumeHolley ,

I'm working with @andreaswallberg and I have run some of the analyses he references. I have looked some more at the inputs, and I conclude that the short reads actually are correct in number, but the long reads count are 17 times what they should be. My command was the following:

Ratatosk -v -c 120 -s data/raw/chromium/SI-GA-E8_1.fastq.gz -s data/raw/chromium/SI-GA-E8_2.fastq.gz -s data/raw/chromium/SI-GA-E8_3.fastq.gz  [... many more -s inputs here ...] -l data/interim/sequences/longreads1.fastq.gz -l data/interim/sequences/longreads2.fastq.gz [... many more -l inputs here ...] -o out_long_reads 2>&1 > logs/ratatosk.log

and this is the log output:

Ratatosk::Ratatosk(): Building graph from short reads (1/2).
KmerStream::KmerStream(): Start computing k-mer cardinality estimations
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 110238555430
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 26094062674
CompactedDBG::build(): Estimated number of k-mers occurring twice or more: 33513578757
CompactedDBG::build(): Estimated number of minimizers occurring twice or more: 7917829568
CompactedDBG::filter(): Closed all fasta/fastq files
CompactedDBG::filter(): Processed 460536046198 k-mers in 5946197800 reads
CompactedDBG::filter(): Found 111533669536 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 229096730
CompactedDBG::construct(): Extract approximate unitigs
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)
CompactedDBG::construct(): Before split: 1922462370 unitigs
CompactedDBG::construct(): After split (1/2): 1648294502 unitigs
CompactedDBG::construct(): After split (2/2): 1699862435 unitigs
CompactedDBG::construct(): Unitigs split: 19738152
CompactedDBG::construct(): Unitigs deleted: 274410614

CompactedDBG::construct(): Joining unitigs
CompactedDBG::construct(): After join: 1454260572 unitigs
CompactedDBG::construct(): Joined 245601863 unitigs

CompactedDBG::write(): Writing graph to disk
KmerStream::KmerStream(): Start computing k-mer cardinality estimations
CompactedDBG::build(): Estimated number of k-mers occurring at least once: 18226978586
CompactedDBG::build(): Estimated number of minimizer occurring at least once: 3771964709
CompactedDBG::filter(): Closed all fasta/fastq files
CompactedDBG::filter(): Processed 80269177395 k-mers in 1454260572 reads
CompactedDBG::filter(): Found 18423112307 unique k-mers
CompactedDBG::filter(): Number of blocks in Bloom filter is 124598487
CompactedDBG::construct(): Extract approximate unitigs
CompactedDBG::construct(): Closed all input files

CompactedDBG::construct(): Splitting unitigs (1/2)

CompactedDBG::construct(): Splitting unitigs (2/2)

Are line breaks not allowed in the input sequence files (i.e. should the sequence be on one line only)?

Cheers,

Per

percyfal commented 3 years ago

Update: I checked the inputs and the sequences are on single lines, so line breaks does not seem to be the issue.

GuillaumeHolley commented 3 years ago

Hi @percyfal,

Thank you for your input. Your command line seems just fine to me. Line breaks shouldn't be a problem either as they are supported by Ratatosk. So you are saying that the long reads count is 17x what it should be. Since @andreaswallberg was saying that your input is about 84 million long reads, it means you get something like 1.4 billion long reads in output right? Can you check if the output contains duplicated reads (same read occurring multiple times) or duplicated read names (same read name occurring multiple times but not necessarily the same sequence each time)?

Thanks. Guillaume

percyfal commented 3 years ago

The 17X I got from the log file which states "Processed 80269177395 k-mers in 1454260572 reads". Does this log statement refer to input reads or output? I would have assumed the former.

I don't think the output contains duplicated reads:

head out_long_reads_sr.fasta | grep ">"
>0
>1
>2
>3
>4

vs

tail out_long_reads_sr.fasta | grep ">"
>1454260567
>1454260568
>1454260569
>1454260570
>1454260571

which would indicate unique consecutive ids in the output. In any case, the relation to the input id is lost, but that's maybe to be expected?

GuillaumeHolley commented 3 years ago

The statement Processed X k-mers in Y reads does not refer to the input long read data but I understand where the confusion comes from. This statement occurs twice in your log: the first time, it refers to the input short read data used for the graph construction (in this statement, one PE read = 2 "reads"). For the second occurrence, it actually does not refer to the input short reads nor the input long reads. Rather, the second occurrence is about an internal step where the de Bruijn graph with k=31 is built from the unitigs of the de Bruijn graph built with k=63 (it is a graph cleaning step).

GuillaumeHolley commented 3 years ago

Hi @percyfal and @andreaswallberg, any news on this issue?

andreaswallberg commented 3 years ago

Hi @GuillaumeHolley !

We are working on it, trying to figure out what is going on. Since smaller chunks worked well for me using the genome guided approach on a different machine, we are cross-referencing commands and data sources to figure out if there may actually be some hardware architecture issue.

Keep the issue open for the time being.

GuillaumeHolley commented 3 years ago

Thanks for the update.

GuillaumeHolley commented 3 years ago

Hey guys. I'm closing the issue for the moment. Let me know when you get some update on the matter and I will reopen it.