Open Vunbelievable opened 2 months ago
I wonder if this simply has to do with the level of sequencing errors you are introducing? if you leave out the -e
option, do you still see positions that are not C or G?
I'm sorry for the late reply. I set -e to 0, but I can still see some positions are A or T. And I'm certain that I used the same reference genome. For example: chr17 88882535 chr17 88882537 chr17 88882540
I've got a feeling that this is gonna be a tricky one, as the conversion itself is definitely happening for bases involving C only:
if ($base eq 'C'){
++$total_C_count;
### converting each C with an individual conversion rate (set globally)
my $random = int(rand(10000)+1)/100;
if ($random <= $conversion_rate){
++$converted_C_count;
$base = 'T';
if ($truth_set){
...
Just checking: do you also specify the parameter --snps
by any chance? And would you be able to extract the one sequence entry from the FastQ file above? thanks
I used the default value for the --snps
.
Here is an R1 and its corresponding R2 from the FASTQ files.
@1_chr17:88882501-88882780_R1
CTCCATTTATCTTTAACTTTTAGCAGGATTGACTATCTTTGCTTTTTTACCAGATTCTGTTCCTATACCTGCTGTGTAAGGCCTTCTCGACTTTTTGTAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@1_chr17:88882501-88882780_R2
TAACAACTGTCTTTTGTTTTCCTTTAACTCAAAATAAAAAAAATCATCTATATTAACAACCACCTTAATTTAAAATCACACTCCATTTCATGTTAATGAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Thanks for the read, I will investigate. The good is that the read aligns to the correct position, and the CpG and non-CpG methylation calls are all fine - it really might only affect the truth-set coordinates. I'll keep you posted.
I've got some good and bad news on this issue.
The good news: I have found the culprit: for paired-end reads the truth-coordinates are generated and written out at the conversion stage - at which point the coordinates are correct. However, for standard libraries the sequence of read 2 gets reverse-complemented afterwards, and for non-directional libraries R1 and R2 may get swapped and one of the reads gets reverse-complemented. In short, the coordinates in the truth set are correct for one read, but not the other.
The bad news: I ran out of time before my annual leave to add a fix for the situation at hand, so I am afraid it'll take a while. In the meatime, I would suggest not using the truth-set coordinates as a true positive control for paired-end reads. Single-end reads should be unaffected.
Thanks for the reply. I will try to use the single-end mode!
Hi Felix, I used Sherman dev with the following command line:
And I got 'positional_changes.txt' containing positions that underwent bisulfite conversion. Like: chr2 31674575 chr2 31674612 chr2 31674631 chr9 89627170 chr9 89627176 But when I check the positions in IGV, I found that some of them are A or T in the reference sequence, not C or G.
It seems that the actual conversion sites are not very accurate. Do you have any suggestions? Thanks in advance