lss1227 commented 7 years ago

Hi, I've been using SNPsplit recently with our monkey data. The input is the output of bismark (v 0.18.0,default parameters). SNP information is generated by genome resequencing. And this is part of my allele-tagging report. SNPsplit seems working well. What I am confused about is the bold text. Why are there so many reads not containing one of the expected bases at known SNP positions (Well I think it's a little much )? Is it because of mismatches allowed by bismark? I also wonder if there are recommended parameters for bismark when the output is used for phasing ? Thank you very much for comments.

Allele-tagging report

Processed 366283612 read alignments in total Reads were unaligned and hence skipped: 0 (0.00%) 288619297 reads were unassignable (78.80%) 35559649 reads were specific for genome 1 (9.71%) 34895376 reads were specific for genome 2 (9.53%) 30805929 reads did not contain one of the expected bases at known SNP positions (8.41%) 7209290 contained conflicting allele-specific SNPs (1.97%)

lss1227 commented 7 years ago

And when I use bismark v0.18.0, there are too many "This read pair has a worse alignment score than the best alignment so far and will be ignored even though it is ambiguous in itself" printing on screen. I took a glance at the script, I think it should be annotation ?

FelixKrueger commented 7 years ago

To start with the second question first, I accidentally left a warning message in there which I have corrected in the v0.18.1 release (https://github.com/FelixKrueger/Bismark/releases). Just ignore them for the time being.

FelixKrueger commented 7 years ago

Maybe first of all thanks for your feedback, your project sounds quite exciting! I am assuming you performed genome re-sequencing, called SNPs on this data and then used the VCF file with the SNPsplit genome preparation? Did you get a chance to also sequence the maternal and paternal genomes as well in order to consistently phase the genomes? I would assume that unphased SNP data would also work in the SNPsplit genome preparation process and you should be able to get allele-specific methylation information for reads covering individual SNPs, but you would not be able to assign this information consistently to the maternal or paternal alleles.

I just had a look at the code which is responsible for assigning reads to the did not contain expected bases counter, and found that this only occurs if a read is overlapping known SNPs but returns as neither genome 1 nor genome 2 (both have a count of 0). As such these reads are a subset of the Unassignable fraction of reads. If SNP positions involve a C in either the reference or the alternative locus the position is skipped for the scoring entirely.

Therefore I would assume that the reads do indeed have one of the two remaining (neither reference nor SNP) bases, or potentially Ns, at the positions marked in bold. Theoretically it should work to run SNPsplit with the option --verbose to get a lot of output for most of the decisions it is making. If you could identify a few of these reads (just reprocess the unassigned bam file) you should be able to either see that there are indeed reads with different bases than the SNPs you were expecting from the VCF file, or if there was a systematic problem within SNPsplit - which I don't hope there is... - it would be good if you could paste a few examples here so I can go and track the issue down!

I just had a look at the example SNPsplit report for bisulfites data which we put up on the Babraham website (https://www.bioinformatics.babraham.ac.uk/projects/SNPsplit/BS-seq_report.pdf) it does actually look like we saw a similar value of unexpected bases. Maybe the bisulfite treatment has something to do with this I am wondering?

Regarding the stringency of Bismark when aligning to a genome that is known to involve SNPs, or when the genome is not quite as finished as the human or mouse genomes are, I think it would be fair to relax the --score_min function a little to allow a few more 'errors' during the alignment, e.g. I would probably start by using --score_min L,0,-0.4.

I hope we will get to the bottom of this, All the best, Felix

lss1227 commented 7 years ago

Sorry for the late reply. Well, I do call SNPs on genome re-sequencing data. Then I write my own script to generate Ngenome. We don't sequence the maternal and paternal genomes. I used --verbose to see how it make decisions with unassigned bam file and writed a script to summarize the make up of reads which did not contain one of the expected bases at known SNP positions . And here is the result:

All snps in a read involves CT/GA SNPs
All snps in a read can not be found in ref or alt (that is, reads did not contain one of the expected bases at known SNP positions)
The condition of some snps in a read belongs to condition 1( snps involves CT/GA ), others belong to condition 2 (snps cannot be found in ref or alt)
Reads covers "NNNNN...NNNNN" part of reference.
SNPs in a read involves C/G.

The majority of the bold part is condition 1, which accounts for almost 99% of bold part. Condition 2, 3,4 are normal. I think I do find something about condition5. For example,here is a read in sam format: E00509:97:HFFG7ALXX:1:1101:11749:6442_1:N:0:ATTCTAAT 83 chr2 156693514 24 150M = 156693450 -214 CTTCATCACTACCTTTTACCTCTATCAACTCAATAATTACCTCCTTATACAACTTACTTCCGTTATCTATTATTAAAATACATTCATACACTAACAACCATTCTTCATTCCCTTTTCCCTATCTTTCTTTATAATACTAATAATTTAACA JJJFJFJJFA-FJJJJA<FJFJJJJFAAFJJJJFFJAJJJJJJJJJJJJJJFJF7F-JFF-JFJFJJJJJJAJJJJJFJJJJJJJFJFJJJJFAF-J7JJJJJJJJJF-JJJJJJJJJJJJ-JJJJJJJJJJJF--JFJJJFJ-JF-FAA NM:i:15 MD:Z:10G12G4G6G2G12G4A7G6G3G17G3G22N9G8G10 XM:Z:..........x............x...........h..h............x.........Z..h......h...h.................h....................................h........h.......... XR:Z:CT XG:Z:GA

SNPsplit running info: chr2 156693514 150M CTTCATCACTACCTTTTACCTCTATCAACTCAATAATTACCTCCTTATACAACTTACTTCCGTTATCTATTATTAAAATACATTCATACACTAACAACCATTCTTCATTCCCTTTTCCCTATCTTTCTTTATAATACTAATAATTTAACA 10G12G4G6G2G1 2G4A7G6G3G17G3G22N9G8G10 Read is a single contiguous match: 150M MD-tag: 10G12G4G6G2G12G4A7G6G3G17G3G22N9G8G10 10G12G4G6G2G12G4A7G6G3G17G3G22 9G8G10 element contains a Non-DIGIT: 10G12G4G6G2G12G4A7G6G3G17G3G22 (adjusting SNP position) SNP was present: ref: C snp: G pos: 120 CTTCATCACTACCTTTTACCTCTATCAACTCAATAATTACCTCCTTATACAACTTACTTCCGTTATCTATTATTAAAATACATTCATACACTAACAACCATTCTTCATTCCCTTTTCCCTATCTTTCTTTATAATACTAATAATTTAACA base in read was: A The base was different than both genome 1 or genome 2: A now adjusting the position for the N itself pos right now: 121

We can see that the read is from OB strand. In fact this read can be assigned to G2. Condition 5 accounts for about 0.5 %. I notice that SNPsplit do take SNP with C or G into account. But the judgement statement of score_bisulfite_SNPs may result in this error ? I have modified score_bisulfite_SNPs function in SNPsplit: score_bisulfite_SNPs. I think there is a more simple judgement statement but I write all in order not to miss any condition.

Thank you very much. (I hope I express myself clearly, haha~)

FelixKrueger commented 7 years ago

Just to say that I haven't forgotten about this, but I will need some time to read this carefully and will hopefully get to it soon.

FelixKrueger commented 7 years ago

Hi iss1227,

Apologies for the delay, but I have now looked at this issue in more detail. I agree that there were some sub-cases which had been overlooked previously. In more detail: only for C>G/G>C SNPs, and there only for either the top or bottom strand (depending on the position), and there only the bisulfite converted form the reads had not been assigned to one of the alleles. I have now tried to add this extra case to the logic that was already in place, and here are the new results using a 500000 line test case (129/Cast mouse hybrid):

Allele-tagging report Old(v0.3.2)
=====================
326578 reads were unassignable (65.32%)
84999 reads were specific for genome 1 (17.00%)
87250 reads were specific for genome 2 (17.45%)
53690 reads did not contain one of the expected bases at known SNP positions (10.74%)
1149 contained conflicting allele-specific SNPs (0.23%)

Allele-tagging report New (0.3.2_dev)
=====================
324270 reads were unassignable (64.86%)
86096 reads were specific for genome 1 (17.22%)
88432 reads were specific for genome 2 (17.69%)
51382 reads did not contain one of the expected bases at known SNP positions (10.28%)
1178 contained conflicting allele-specific SNPs (0.24%)

It seems that this rescues a few extra reads even though the overall gain was only around 0.2% per allele.

I have also tried out the code in the subroutine you linked above and this produced slightly different results:

Allele-tagging report (your bisulfite phasing subroutine)
=====================
332617 reads were unassignable (66.53%)
86145 reads were specific for genome 1 (17.23%)
80085 reads were specific for genome 2 (16.02%)
59729 reads did not contain one of the expected bases at known SNP positions (11.95%)
1129 contained conflicting allele-specific SNPs (0.23%)

I did not go into further detail understanding why your results differ (but one is going up the other one down...).

Initially I was a little puzzled that the total number of reads did not contain one of the expected bases at known SNP positions was still around 10%, but when I had a second look I realised that C>T or G>A SNPs (which are being ignored for the allele-assignment) would also result in reads that can't be assigned to either allele. Thus, the name of the counter is probably not ideal as it is confounding C>T SNPs (which are being ignored intentionally) and SNPs that really didn't contain one of the expected bases. For a future release I should be looking into a way of counting these reads differently.

So for the time being, would you mind cloning the latest development version and testing whether it resolves the issues you have seen so far? Many thanks! Felix

FelixKrueger commented 7 years ago

I have now had another go at the Allele-tagging report. I have introduced a new counter for the --bisulfite mode that aims to disentangle reads which were inassignable as a consequence of containing C>T SNPs, and reads that contained none of the expected bases at the position. The report now looks like this:

Allele-tagging report
=====================
Processed 499976 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
324270 reads were unassignable (64.86%)
86096 reads were specific for genome 1 (17.22%)
88432 reads were specific for genome 2 (17.69%)
49417 reads that were unassignable contained C>T SNPs preventing the assignment
1965 reads did not contain one of the expected bases at known SNP positions (0.39%)
1178 contained conflicting allele-specific SNPs (0.24%)

These two changes have now made it into a new release v0.3.3 because I believe it is important not to introduce the bias in allele-assignment. Thanks for bringing this to my attention, and I still welcome any further comments on this issue. Best, Felix

FelixKrueger / SNPsplit

Why are there so many reads not containing one of the expected bases at known SNP positions ? #11

Allele-tagging report