FelixKrueger / Sherman

A simple Bisulfite FastQ Read Simulator (BiQRS)
GNU General Public License v3.0
11 stars 6 forks source link

True set of changed positions #15

Open Vunbelievable opened 2 months ago

Vunbelievable commented 2 months ago

Hi Felix, I used Sherman dev with the following command line:

${Sherman}/Sherman --length 100 --number_of_seqs 10000000 --genome_folder ${ws}/SimulateData/ref --paired_end --conversion_rate ${cr} --error_rate 0.2 --non_directional --truth_set

And I got 'positional_changes.txt' containing positions that underwent bisulfite conversion. Like: chr2 31674575 chr2 31674612 chr2 31674631 chr9 89627170 chr9 89627176 But when I check the positions in IGV, I found that some of them are A or T in the reference sequence, not C or G.

image

It seems that the actual conversion sites are not very accurate. Do you have any suggestions? Thanks in advance

FelixKrueger commented 2 months ago

I wonder if this simply has to do with the level of sequencing errors you are introducing? if you leave out the -e option, do you still see positions that are not C or G?

Vunbelievable commented 2 months ago

I'm sorry for the late reply. I set -e to 0, but I can still see some positions are A or T. And I'm certain that I used the same reference genome. For example: chr17 88882535 chr17 88882537 chr17 88882540

image
FelixKrueger commented 2 months ago

I've got a feeling that this is gonna be a tricky one, as the conversion itself is definitely happening for bases involving C only:

    if ($base eq 'C'){
      ++$total_C_count;
      ### converting each C with an individual conversion rate (set globally)
      my $random = int(rand(10000)+1)/100;

      if ($random <= $conversion_rate){
          ++$converted_C_count;
          $base = 'T';

          if ($truth_set){
    ...

Just checking: do you also specify the parameter --snps by any chance? And would you be able to extract the one sequence entry from the FastQ file above? thanks

Vunbelievable commented 2 months ago

I used the default value for the --snps. Here is an R1 and its corresponding R2 from the FASTQ files.

@1_chr17:88882501-88882780_R1
CTCCATTTATCTTTAACTTTTAGCAGGATTGACTATCTTTGCTTTTTTACCAGATTCTGTTCCTATACCTGCTGTGTAAGGCCTTCTCGACTTTTTGTAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@1_chr17:88882501-88882780_R2
TAACAACTGTCTTTTGTTTTCCTTTAACTCAAAATAAAAAAAATCATCTATATTAACAACCACCTTAATTTAAAATCACACTCCATTTCATGTTAATGAC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
FelixKrueger commented 2 months ago

Thanks for the read, I will investigate. The good is that the read aligns to the correct position, and the CpG and non-CpG methylation calls are all fine - it really might only affect the truth-set coordinates. I'll keep you posted.

Screenshot 2024-08-08 at 11 47 29
FelixKrueger commented 2 months ago

I've got some good and bad news on this issue.

The good news: I have found the culprit: for paired-end reads the truth-coordinates are generated and written out at the conversion stage - at which point the coordinates are correct. However, for standard libraries the sequence of read 2 gets reverse-complemented afterwards, and for non-directional libraries R1 and R2 may get swapped and one of the reads gets reverse-complemented. In short, the coordinates in the truth set are correct for one read, but not the other.

The bad news: I ran out of time before my annual leave to add a fix for the situation at hand, so I am afraid it'll take a while. In the meatime, I would suggest not using the truth-set coordinates as a true positive control for paired-end reads. Single-end reads should be unaffected.

Vunbelievable commented 2 months ago

Thanks for the reply. I will try to use the single-end mode!