RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
59 stars 8 forks source link

Arioc underestimating CpG methylation #32

Closed jpadesousa closed 1 year ago

jpadesousa commented 1 year ago

Hello,

I have been testing Arioc for a few days and I can see some clear speed advantages compared to Bowtie2. Also, by tweaking the parameters, I can get a higher percentage of mapped reads with Arioc. However, my main motivation to use Arioc is to use it in bisulfite sequencing alignment since those can take days with Bismark depending on the number of samples and sequencing depth.

I compared Arioc and Bismark using the same mock BS-Seq data created using the package Sherman: Sherman --length 150 --number_of_seqs 100000 --genome_folder genome --quality 40 --minfrag 200 --maxfrag 500 --CG_conversion 80 --CH_conversion 2 --error_rate 1 --truth_set --colorspace --paired_end

I iterated through multiple Arioc parameters: batchSize maxJ (reads) maxJ (reference) seedDepth AtO seed (hsi25_0_30_CT vs hsi25_0_31_CT) maxBQS Vt

But I could never get the same results as Bismark after Bismark methylation extractor with minimum alignment score as default or --local: 79.7% CpG methylation 2.3% Non-CpG methylation

The best I got with Arioc was: 78.3 to 79% CpG methylation 2.2 to 2.4% Non-CpG methylation Mostly by increasing maxBQS, which reduces the number of analysed Cs ending up with less analysed Cs than Bismark, and setting At0 to 6.

For the remaining settings I kept: batchSize=64k maxJ=16 (reads) maxJ=16 (reference) seedDepth=6 Vt="L,-32,2" seed = hsi25_0_31_CT & ssi84_2_31_CT

Changing these setting had little effect on the CpG methylation percentage.

I don't know what can I do to improve these results. I haven't tried setting different error rates in Sherman which could change the results. Still, I was expecting that Arioc would perform better than Bismark in assigning methylated CpGs. I really would like to implement this package into our pipelines but it needs to be comparable to Bismark. Is there a way to improve the CpG methylation results in Arioc?

RWilton commented 1 year ago

I think that by systematically examining Arioc's behavior with simulated reads we can figure out what's going on here. In my experience, differences between Arioc and Bismark wrt methylation context calls have several possible explanations:

1) The same read sequence might map differently with Arioc versus Bismark. This is not easy to pin down, but I am going to guess that Arioc could search a lot more than 16 potential mapping locations (maxJ="16") without a significant decrease in speed. This won't matter with easy-to-map reads with high alignment scores, but for reads that contain indels/mismatches, Arioc is more likely to find a higher-scoring mapping if you let it explore as many potential mapping locations as Bismark. So I would suggest bumping that constraint up to 64 or more. (GPU memory can become a limiting factor in this case, so you might have to decrease the batch size accordingly. Arioc will let you know if it runs out of GPU memory.)

There may be other things you could do as well. I might see something if you're willing to post Arioc's output log.

2) Arioc calls methylation contexts with regard to the read sequence; Bismark always uses the read sequence. This matters in 3 ways. First, Arioc calls "unknown" methylation contexts for the last bases in a read where Bismark doesn't. Also, where there is true variation in a read sequence, Arioc's methylation context calls reflect that whereas Bismark's remain unchanged. Third, you see the same effect with basecall errors. (The purpose of the maxBQS filter is to deal with this problem.)

I doubt that this explains what's happening here -- it all depends on the basecall error rate, but I'd be surprised if this caused more than a 1-2% difference in CpG call rate. But you could measure it just by filtering on the last couple of positions in the XM field in the SAM output.

3) Please be certain that you are examining only primary mappings with MAPQ > 3 [Note: incorrectly stated as "MAPQ<=3" in the original version of this comment] with both aligners.

I'm very curious what you discover, and I will be happy to help you get to the bottom of this.

jpadesousa commented 1 year ago

Dear @RWilton,

Thank you so much for the response. I contacted Felix Krueger regarding Sherman to make sure I am simulating reads correctly. After that, I will do a systematic comparison between Arioc and Bismark and provide a more detailed information about the parameters, results, output log, and hardware/software specs I am using.

So far, I had no issues with GPU memory limits, even after increasing maxJ and batchSize. But this is likely because I am using small files. I will increase those parameters in my testing.

I will get back to you today or tomorrow with the results.

jpadesousa commented 1 year ago

Dear @RWilton,

Sorry for the late reply. I couldn't find time to finish the testing last week.

Bismark simulated reads For this test, I had to correct the command used to run Sherman. For some reason, I was thinking conversion as methylation. The corrected command is: Sherman --length 150 --number_of_seqs 1000000 --genome_folder genome --quality 40 --minfrag 200 --maxfrag 500 --CG_conversion 20 --CH_conversion 80 --error_rate 0.1 --truth_set --paired_end

I generated 3 sets of simulated data by changing the --error_rate value with 0.1%, 1%, and 5%. With this command, the expected CpG methylation is 80% and non-CpG methylation is 2%.

Machine specifications

Ubuntu 18.04.6 LTS (Bionic Beaver)

# CPU
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
Stepping:            4
CPU MHz:             3604.218
CPU max MHz:         4500,0000
CPU min MHz:         1200,0000
BogoMIPS:            6599.98
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K
NUMA node0 CPU(s):   0-19

# Memory
62.5Gb ram memory

# GPU
2 NVIDIA GeForce RTX 2080
NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4

When installing Arioc, I used make AriocP CUDA_CC=75 to match the compute capability of the GPUs.



I used the following scripts:

Reference genome encoding nongapped_CT

As reference genome, I only used chromosome 1 from Homo sapiens GRCh38 because of memory limitations. I downloaded the FASTA file from https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna.

<?xml version="1.0" encoding="utf-8"?>

Homo_sapiens.GRCh38.dna.chromosome.1.fa /home/local_admin/WGBS_test/benchmarking/arioc/reference_genome/enc_maxJ_64

Reference genome encoding gapped_CT <?xml version="1.0" encoding="utf-8"?>

Homo_sapiens.GRCh38.dna.chromosome.1.fa /home/local_admin/WGBS_test/benchmarking/arioc/reference_genome/enc_maxJ_64



Samples encoding <?xml version="1.0" encoding="utf-8"?>

simulated_1.fastq simulated_2.fastq ./enc/1M_01_err_rate



Alignment _(alignmentpaired.cfg) <?xml version="1.0" encoding="utf-8"?>

/home/local_admin/WGBS_test/benchmarking/arioc/reference_genome/enc_maxJ_64 simulated_1 simulated_2 /home/local_admin/WGBS_test/benchmarking/arioc/alignment/1M_01_err_rate/xmBQS/15 /home/local_admin/WGBS_test/benchmarking/arioc/alignment/1M_01_err_rate/xmBQS/15 /home/local_admin/WGBS_test/benchmarking/arioc/alignment/1M_01_err_rate/xmBQS/15 /home/local_admin/WGBS_test/benchmarking/arioc/alignment/1M_01_err_rate/xmBQS/15



In the Alignment script, I changed 2 parameters: xmBQS and Vt. I iterated xmBQS between 10, 15, 20, 25, 30, 35. And Vt I used "L,-32,2" for error rate 0.1% and 1% (I tested different values and this gave me the best results). For error rate 5%, I used Vt="G,20,8" (local alignment minimal score used by Bowtie2) because with "L,-32,2" the percentage of aligned reads would go down to ~57%.

Then, I also iterated through ignoring a number of nucleotides at the 3' end in bismark_methylation_extractor by using the flags: --ignore_3prime and --ignore_3prime_r2. The values for those flags where kept equal between them throughout the iteration.



In summary:

Error rate: 0.1%, 1%, 5% xmBQS: 10, 15, 20, 25, 30, 35 --ignore_3prime and --ignore_3prime_r2: 10, 20, 30, 40, 50

Then, I also used samtools view with the flags -bS -F 4 -F 8 -F 256

-F 4 removes unmapped reads
-F 8 removes unmapped mates
-F 256 selects only primary alignments

Arioc alignment command

AriocP alignment_paired.cfg

samtools view --threads 20 -bS -F 4 -F 8 -F 256 simulated.c.001.sam > simulated.c.001.bam

bismark_methylation_extractor --parallel 20 -p simulated.c.001.bam

Bismark alignment command Always with --local alignment

bismark --genome_folder ../../reference_genome -1 ../../samples/bs_simulated_reads/1M_01_err_rate/simulated_1.fastq -2 ../../samples/bs_simulated_reads/1M_01_err_rate/simulated_2.fastq -I 200 -X 500 --phred33-quals --unmapped --ambiguous --local --no_dovetail --sam

samtools view --threads 20 -bS -F 4 -F 8 -F 256 simulated.c.001.sam > simulated.c.001.bam

bismark_methylation_extractor --parallel 20 -p simulated.c.001.bam



Results: 0.1% error rate image

1% error rate image

5% error rate image

Conclusion:

For 0.1% error rate

For 1% error rate

For 5% error rate



I hope this helps to narrow down how to improve Arioc for bisulfite sequencing. I can also share the Arioc's output reports, if necessary. It's important to note that I didn't compare the methylated and unmethylated C positions and their sequence context. This would be the correct comparison.

RWilton commented 1 year ago

First, thank you for taking the time to systematically explore the performance difference between Arioc and Bismark in terms of identifying cytosine bases mapped in a CpG context.

I am afraid, however, that this analysis is not entirely on point.

1] Arioc and Bismark use the same logic to label methylation context, so any differences between them in this regard pertain to differences that typically account for fewer than 1% of the methylation context calls in WGBS reads.

The fact that Arioc uses read sequence to determine methylation context and Bismark uses reference sequence to do this can account for performance differences, but again, these differences are limited to a) the last 2 bases in a read sequence and b) the two bases following each cytosine call. In practice, the effect of case (a) is negligible, and easy to handle by ignoring the last 2 bases of each XM field after alignment (and obviously NOT by trimming before that) . We have only seen case (b) in "noisy" reads with a high error rate; such reads are more likely to have Cs followed by one or two bases that differ from the reference.

Which brings us to...

2] When you do have reads with a high error rate, erroneously-called bases ought to have low base quality scores, so excluding such bases from methylation context analysis ought to improve the overall accuracy of methylation context calls. In practice, you would use a tool like FastQC to identify this situation and give you an idea of where to set xmBQS. You might also notice this if you use M-bias plots to analyze your results.

This is the only use case for xmBQS in Arioc. To answer the question of how to tune Arioc so it delivers identical CpG values to Bismark, however, I don't see how experimenting with different values for xmBQS can reveal anything useful in terms of performance (especially with simulated data).

It is interesting that varying xmBQS seems to affect the total number of CpG calls only. This may tell us something about how Sherman works, but it's still not saying anything about what Arioc is doing.

3] Similarly, I do not see how excluding the rightmost 40-50 bases from the methylation context analysis can provide useful aligner-performance information. Again, this may tell us something about how Sherman distributes methylated cytosines in simulated read sequences, but not about the behavior of read aligners in identifying methylation contexts.

4] My instinct is that the ~0.5% difference you see between the aligners is attributable to their read-mapping behavior. I would focus on optimizing this before I worried further about downstream analysis:

I am on vacation until August 25, but I will be happy to dig more deeply into this with you -- including reproducing your work on my end if that will help -- once I get back into the "usual routine."

And thank you again for your efforts to dig into this. The goal at my end is to improve the Arioc software, so taking a close look at its behavior is always worthwhile.

jpadesousa commented 1 year ago

Hi @RWilton,

Sorry, I just noticed that the scripts were not fully copied in my reply. Still, I think you got all the information necessary.

Thank you so much for taking some time to explain how Arioc works and the differences compared to Bismark during your holidays. My goal is to implement the package into our analysis and to provide it as a tool for the other labs using our cluster. That's why I want to make sure that we trust the results.

Since I have the simulated reads and scripts ready, I will proceed with some more tests as you suggested:

I can also increase the Hash bits value to 31. From the Arioc's guide you wrote:

If enough memory is available, "wider" hash values and correspondingly larger hash tables may provide up to 5% additional speed and about 0.5% additional sensitivity

At 32, the machine I am using runs out of memory.

hsi25_0_30_CT to hsi25_0_31_CT ssi84_2_30_CT to ssi84_2_31_CT

At the end, I want to improve upon the current aligners' speed while maintaining or improving accuracy. For the moment, I will try to obtain the best alignment possible disregarding speed. Then we can focus on improving speed with minimal loss of accuracy.

RWilton commented 1 year ago

Just one further comment.

Simulated reads are useful primarily for verifying that a read aligner is doing the right thing in terms of read mapping, MAPQ computation, and so on. But there are limits to what you can accomplish by tuning aligner performance to simulated read data.

Real-world sequencer reads have characteristics (paired-end overlap, trimming artifacts, position-related basecall error rates, and probably a dozen other things I'm forgetting) that are tricky to model in a read simulator. Ultimately, to feel confident that you are using the best alignment tool for your data and that the tool is optimally configured, you need to do a bit of performance analysis with that data (at least, a small but representative subset of that data).

At some point, I would accept that the behavior of Arioc and Bismark with simulated reads is "close enough" in terms of performance, and see what you get when you use your actual sequencer reads.