DecodeGenetics / Ratatosk

Hybrid error correction of long reads using colored de Bruijn graphs
BSD 2-Clause "Simplified" License
94 stars 7 forks source link

a question on command line #47

Closed shinlin77 closed 1 year ago

shinlin77 commented 1 year ago
  1. None of your examples had fastq read 1 and 2. How should these be input? Like...

Ratatosk correct -v -c 4 -s read_file1_1.fq.gz -s read_file1_2.fq.gz -s read_file2_1.fq.gz -s read_file2_2.fq.gz -l LR.fq -o LR_corrected.fq

  1. Also, I've been running the program for two days now and after a day, it's been hanging there using very little CPU (see below for top output). Is that typical behavior?
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32599 saling  20   0  102.5g 102.2g   1260 S   4.6 10.1   3635:54 Ratatosk
31845 saling   20   0  110.0g  97.7g   1876 S   3.3  9.7   5367:44 Ratatosk
GuillaumeHolley commented 1 year ago
  1. Just use -s for FASTQ read file 1 and 2 in any order (like in your example). Ratatosk automatically associates reads with the same FASTQ name as being from the same pair. It is why it is of the utmost importance that reads from the same pair must have the same FASTA/FASTQ name (no /1 or /2 at the end of the FASTQ name).
  2. A few steps in Ratatosk use quite a bit of disk and in these steps, CPU usage might be low like in your case because of the IO wait. It doesn't mean the program is stuck though. Overall running time depends on a few factors: Illumina coverage in input, number of threads given to the program, size of the genome, etc. Also, if you could also share the program log and the files that have been created so far by the program, I could tell you more about how far along you are in the correction process.
shinlin77 commented 1 year ago

Thanks for responding!

  1. Names--will this work? There are 1 and 2 but not /1 or /2.

from file 1 @A00253:698:HFCC5DSX3:4:1101:10041:1000 1:N:0:TTGGTGCA+CCGTTATG

form file 2 @A00253:698:HFCC5DSX3:4:1101:10041:1000 2:N:0:TTGGTGCA+CCGTTATG

  1. I don't see a log file. Nor do I see any files it is writing to disk. I did run it with -v. Where does it go?
GuillaumeHolley commented 1 year ago
  1. Unfortunately no, this won't work because of the 1:N:0:TTGGTGCA+CCGTTATG and 2:N:0:TTGGTGCA+CCGTTATG difference. Reads names would have to be exactly the same to be considered as being in a pair. Right now, if this is what you started Ratatosk with, Ratatosk will consider those reads to be single-end (unpaired) reads and the accuracy of the correction won't be nearly as good as if you had paired-end reads.
  2. My bad, there is no log file written by Ratatosk but a console/terminal output when using -v. If you had that console/terminal output, it would be very helpful. The output files should be written at the path given in -o. If you haven't specified a path, it is written in the current working directory (the directory where you were at the moment you started Ratatosk).
GuillaumeHolley commented 1 year ago

It would also be helpful to know what is the Illumina coverage of your reads and how many threads you are using.

shinlin77 commented 1 year ago

OK, lemme do some renaming. FYI, Illumina and ONT are at 30x each. Before stopping the current runs (because they were not optimally named) the output was the following. That was about after a day on 4 threads. There hasn't been any output for a day since.

Ratatosk::addCoverage(): Processed 617000000 reads. Ratatosk::addCoverage(): Processed 618000000 reads. Ratatosk::addCoverage(): Processed 619000000 reads. Ratatosk::addCoverage(): Processed 620000000 reads. Ratatosk::addCoverage(): Collecting branching information.

GuillaumeHolley commented 1 year ago

Given that you have 620M reads and you said 30x coverage, the genome you are trying to correct should be around 3.1Gbp so I assume human genome right? Even with 30x coverage you are looking at 1000 to 2000 CPU hours running time. Say 2000 CPU hours using 4 threads, in the unrealistic best case where Ratatosk's code would be 100% multi-threaded, you would be looking at 21 days of computation. 4 threads is fine if you're running microbial genomes but for human genomes, I would say the bare minimum should be around 16 threads if you're using a single machine. Otherwise, you're looking at a month computation.

shinlin77 commented 1 year ago

Yes human genome. Thanks for the tip on usage. Will using more threads decrease the amount of RAM needed per thread? Right now, it appears a thread is requiring ~100G of RAM.

GuillaumeHolley commented 1 year ago

The memory usage should be more or less stable regardless of the number of threads.

GuillaumeHolley commented 1 year ago

I'll close the issue but don't hesitate to reopen if you encounter some issues with your new run.

shinlin77 commented 1 year ago

Let me ask, for a human genome run, do you have any idea how much RAM is required? I know it is above 115G.

GuillaumeHolley commented 1 year ago

Peak RAM is probably around 250GB. If you do the correction in multiple steps, peak RAM is only for the first step (index construction 1). All the other steps are much lighter on the RAM usage. I'll get you some real numbers I obtained recently as soon as I get to the office.

shinlin77 commented 1 year ago

Thanks. If it is not too much to ask, can you please break down peak RAM usage down by 2 step versus 4 step, too?

GuillaumeHolley commented 1 year ago

Unfortunately I realize now that I did not record how much RAM was used on my last Ratatosk run, I only recorded the CPU time, so I can only give you that and the memory requirements I had for each job submitted to the cluster. The dataset was 42x stereo duplex ONT and 60x Illumina:

1 - Index construction 1: 1000 CPU hours, required 300 GB of RAM from SLURM 2 - Read correction 1: Input reads to correct were split in batches of 250K reads. Each batch was corrected in parallel on a different machine. mean 5 CPU hours and 100GB required per batch. 3 - Index construction 2: 300 CPU hours, required 200 GB of RAM from SLURM 4 - Read correction 2: Input reads to correct were split in batches of 250K reads. Each batch was corrected in parallel on a different machine. mean 6 CPU hours and 100GB required per batch.

Total: about 1850 CPU hours, max required RAM at 300 GB

shinlin77 commented 1 year ago

this is very helpful. thanks!

shinlin77 commented 1 year ago

I have SR (Illumina) and LR (ONT) on the same WGS samples. I called the variants for them separately and had some amount of concordance. I changed my analysis pipeline to include Ratatosk. I expected there to be much more concordance. But that is NOT what I found. I suspect that I may be running the program incorrectly.

I have renamed my second fastq file as you suggest so that the paired sequences have the same name. The name is

SR_2.fq.gz -> SR_2.fq.gz~

Thereafter, I run

Ratatosk correct -v -c 16 -s SR_1.fq.gz -s SR_2.fq.gz~ -l LR.fq -o /mydirectory/output

Do you see anything wrong with this? Is the new file suffix gz~ throwing it off?

Thanks.

GuillaumeHolley commented 1 year ago

Hi @shinlin77,

I am not sure what you mean by you have "renamed" your second FASTQ file.

  1. Names--will this work? There are 1 and 2 but not /1 or /2.

from file 1 @A00253:698:HFCC5DSX3:4:1101:10041:1000 1:N:0:TTGGTGCA+CCGTTATG

form file 2 @A00253:698:HFCC5DSX3:4:1101:10041:1000 2:N:0:TTGGTGCA+CCGTTATG

The problem was not with file naming but with the fact that reads from the same pairs had different names across the 2 files. Can you tell me what names do reads from the same pair have in SR_1.fq.gz and SR_2.fq.gz~? In general, I am fairly certain that the gz~ suffix would not be any issue for Ratatosk as it automatically recognizes compressed files vs plain text files without parsing the file extension.

Now, you were running human data if I remember correctly which is exactly the type of genome Ratatosk was designed for and tested on. I always got better variant calling results after correction with Ratatosk but you're saying that you are getting less concordance after correction. Could you tell me more about this (concordance before vs concordance after) and the type of variant calling pipeline used?

Guillaume

shinlin77 commented 1 year ago
  1. SR_2.fq.gz used to have a "2" in the sequence names. I have switched those to "1".

SR_2.fq.gz~ @A01426:334:HJF2FDSX3:3:1101:1506:1000 1:N:0:AGTGACCT+CGACCTAA @A01426:334:HJF2FDSX3:3:1101:1579:1000 1:N:0:AGTGACCT+CGACCTAA @A01426:334:HJF2FDSX3:3:1101:1705:1000 1:N:0:AGTGACCT+CGACCTAA

SR_1.fq.gz @A01426:334:HJF2FDSX3:3:1101:1506:1000 1:N:0:AGTGACCT+CGACCTAA @A01426:334:HJF2FDSX3:3:1101:1579:1000 1:N:0:AGTGACCT+CGACCTAA @A01426:334:HJF2FDSX3:3:1101:1705:1000 1:N:0:AGTGACCT+CGACCTAA

  1. Yes, I'm running on human data. Originally, I did

a. SR: bowtie2 -> picard MarkDuplicates -> GATK b. LR: minimap2 → clair3 -> GATK (to put all the gvcfs into one file)

Concordance was measured by whether both pipelines called a variant a site and if so, what were the calls for each individual. The concordance was not bad. Thereafter I did

c. SR+LR: Ratatosk -> minimap2 -> clair3 -> GATK (to put all the gvcfs into one file)

Concordance was then assessed between a & c. It was much poorer.

GuillaumeHolley commented 1 year ago

Yeah, the variant calling pipeline seems very reasonable and your method to check concordance is a good first indicator. I am still a little bit puzzled by the GATK part of the pipeline: do you have multiple individuals (hence multiple GVCFs) or do you call separately each flowcell? In general, I am still surprised that concordance is much poorer after correction. I would not necessarily expect a big jump of results for SNP concordance but ONT is notoriously bad at indels and Ratatosk has always greatly improved those.

If you would be willing to share a bit of your data, I could look at it and see what went wrong. If you can, for one individual, send over BAM files of SR, LR and LR corrected for a 1Mbp region of your choice where there is some discordance. VCF calls for that same region would be a plus.

shinlin77 commented 1 year ago

That would be super valuable for my research. This is the first time I have tried to transfer data after conversing on github. What's the best way to get you this data?

Regarding GATK, for SR, GATK makes the calls for all individuals separately, and then I use it again to aggregate all the resultant individual gvcfs. For LR, GATK is used only for aggregating all the gvcfs produced by clair3.

GuillaumeHolley commented 1 year ago

No worries.

If the files are small enough to fit in a couple emails max, you can email them to me. As I said, pick one individual and one region of 1Mbp max for which you find the discordant calls. For that region and that individual, send me 3 BAM files: SR, LR and LR corrected. The VCF files for that region would be a plus. If the files are too big, I can arrange somewhere safe for you where you would be able to upload them but this will take more time.

GuillaumeHolley commented 1 year ago

Hi @shinlin77,

Sorry for the delay. I looked at all the files you sent me (a 1Mbp region of chr5). Good thing is that region is not necessarily "tricky" as it is almost entirely contained in the HG002 high confidence regions.

First thing I did was to look at all the reads manually in IGV and everything looked okay to me: the corrected reads did not seem to contain the random errors of the raw ONT (most of the time) but still have the SNPs with the correct genotype. Indels are more often than not left uncorrected.

Looking into your GVCF files was tedious because they do not contain any header so I couldn't use them with bcftools. I decided to redo the calling for these files:

All these tools use DeepVariant as backbone and can produce GVCF files than can be merged with GLNexus.

I then uses RTG tools to compare LR and cLR to the Illumina calls (which is obviously biased since even the Illumina calls contain FP and FN but good enough for our purpose). Here are the main takes:

Take those number with a grain of salt: this is based on using an Illumina callset as truthset and using a fairly small number of variants. That being said, I have a new version of Ratatosk which should perform better than the one you used. Let me know if you want to give it a shot.

GuillaumeHolley commented 1 year ago

I actually realized that they were many FP calls in the long reads which were suspicious. It comes from the fact that I ran RTG tools without providing a region of interest and many FP calls were actually calls that started before and after the 1Mbp region of interest (because the long reads extend that region a lot more than the short reads).

Restricting the calls to the 1Mbp region of interest, the above takes still hold (but with different numbers):

GuillaumeHolley commented 1 year ago

I went over the last remaining FP calls in the long reads, both corrected and uncorrected. Many of these sites are called in the SR set as reference calls with a missing ("./.") genotype, most likely because their VAF is off (~25-30%). So I think many of the remaining FP calls in the long reads are actually TP that are not called properly in the Illumina set.

GuillaumeHolley commented 1 year ago

Finally, to give you an order of idea, I used the short reads you sent me to correct the long reads you sent me. I did this with the version of Ratatosk on the repo (current) and the new version I have (upcoming). Then I redid the variant calling and comparison to the Illumina set:

upcoming F1 is 93.12% and current is 92.08%. So upcoming improves the F1 by 1% compared to current (much less FN). I expect this improvement to be greater for a whole genome correction rather than a local correction (like in this example).

shinlin77 commented 1 year ago

First of all, thank you so much for looking at this problem so thoroughly. I do have one consideration. You state, "Many of these sites are called in the SR set as reference calls with a missing ("./.") genotype, most likely because their VAF is off (~25-30%). So I think many of the remaining FP calls in the long reads are actually TP that are not called properly in the Illumina set." When I compare IGV plots between SR and LR (this is before correction), I see...

image

These are three individuals. LR is the top 3, and SR is the lower 3. For the LR, these three individuals are being called hets. I feel those are spurious and are actually due to LR skipping during sequencing. From my visual review of discrepancies between SR and LR, a lot of them look like this. I was hoping Ratatosk would be able to correct these.

GuillaumeHolley commented 1 year ago

On one hand, ONT R9.4 is not very good for indels, especially in homopolymers like here so it could very well be an error. On the other hand, I imagine those 3 samples are a trio? Because if they are, given that the 3 of them show a clear heterozygous deletion of 2bp, there is some support to believe that the deletion is real. Given that this is in a high density SNP region, it is possible that there is something else going on in that region which prevents from the deletion from being represented in the short reads at this loci. In the long reads, can you see an insertion close to that loci? Is the coverage of the short reads "normal" at that loci? Is the MAPQ of the short reads 60 in that region?

shinlin77 commented 1 year ago

These are unrelateds. That being said, I agree with you that the region I showed was a high density SNP region. Nevertheless, there are other regions which are fairly bland except they have a homopolymer. The LR sequences leads to a SNP call, but the SR does not.

image

Indeed, I have 20-some unrelated individuals. Just in the region of chr5:119573311 +/- 1Mb (so double size of the window of area that I sent you), there are 65 positions where the SR and LR reads are discrepant by ~50% or more (so like more than 10 out of 20 individuals have discrepant calls between SR and LR). When I look on IGV, invariably, they are at homopolymer sites where I would tend to think LR is making the mistake.