high reads did not contain one of the expected bases at known SNP positions

ChenfuShi commented 2 years ago

Hi, I am analysing some capture Hi-C data using the latest version of SNPsplit (0.5.0). The data was aligned using an N-masked genome using bowtie2 2.3.0 and the default HICUP 0.7.4 settings.

The software seems to run fine and generates the following report:

Input file:                 'control1CD4_R1_2.hicup.bam'
Writing allele-flagged output file to:      'control1CD4_R1_2.hicup.allele_flagged.bam'

Allele-tagging report
=====================
Processed 187573338 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
176308388 reads were unassignable (93.99%)
5578391 reads were specific for genome 1 (2.97%)
5578608 reads were specific for genome 2 (2.97%)
32792949 reads did not contain one of the expected bases at known SNP positions (17.48%)
107951 contained conflicting allele-specific SNPs (0.06%)

SNP coverage report
===================
SNP annotation file:    control1_SNPsplit.txt.gz
SNPs stored in total:   2141764
N-containing reads: 43954338
non-N:          143515439
total:          187573338
Reads had a deletion of the N-masked position (and were thus dropped):  103561 (0.06%)
Of which had multiple deletions of N-masked positions within the same read: 208

Of valid N containing reads,
N was present in the list of known SNPs:    12657518 (22.66%)
N was not present in the list of SNPs:      43212223 (77.34%)

As you can see there are a lot of reads that did not contain one of the expected bases at known SNP positions. I have been checking the output bam files in IGV and it looks like that there are some reads that contain the correct SNPs but are just stored in the unassigned bam. Interestingly some reads from the same variant are split correctly, so I imagine the annotation file works correctly.

for example:

Is there anything I can do to make it better? Thanks!

ChenfuShi commented 2 years ago

Hi, I've re run a bit of the file using the --verbose mode. I don't know if you can have a look at this and see if it's fine. One thing, when I generated the SNP list I only included the sites that were heterozygous. Could that be the reason it gives such a high percentage of unassigned? run_split.sh.txt

FelixKrueger commented 2 years ago

Hi @ChenfuShi

Just very generally, the rate at which reads can be assigned to a specific allele is a function of:

the number of SNPs (the more the better (up to a point))
the read length (the longer the better)
the experiment type (not exactly sure about capture Hi-C, but since this is quite close to genes there might not be a lot of reads in these functional regions)
the library type (paired-end is better)

I'm not quite sure I understand the IGV image (there are a lot of colours...), but here are a few observations:

143 M our of 187 M reads did not contain an overlap with N, so that's 76% that will go into the Unassigned fractions straight away
a further 11M reads can be assigned to either allele 1 or 2, which I think isn't all that bad considering that you only have 2M SNPs to start from
it is noteworthy though that the majority of reads spanning an N in the genome are not part of the SNP annotation you can SNPsplit for the allele-sorting:
```
Of valid N containing reads,
N was present in the list of known SNPs:    12657518 (22.66%)
N was not present in the list of SNPs:      43212223 (77.34%)
```
Does that mean that you N-masked a lot more positions than you are actually sorting on?

It might also be a good idea to get @StevenWingett's thoughts on this, as I am not all that familiar with ChIC...

ChenfuShi commented 2 years ago

right, I masked all the known 1000 genomes SNPs, so maybe that's where the problem comes from?

FelixKrueger commented 2 years ago

Not exactly sure if 'problem' is the right word then (it seems to run fine from a technical point of view), but it would seem that you masked ~4 times more SNPs as Ns than you give SNPsplit to work with during the allele-splitting/sorting process. Still, 6% of allelel-specific reads doesn't sound terrible given you only used 2M SNP positions.....

ChenfuShi commented 2 years ago

Yeah, I guess that I can just try the next step of the analysis then. what do you mean I only used 2M SNP positions? should I have expected more? these are all the heterozygous positions that I could call from my samples.

Thanks!

FelixKrueger commented 2 years ago

Here is the number of positions you supplied (with the option --snp control1_SNPsplit,txt.gz):

SNP coverage report
===================
SNP annotation file:    control1_SNPsplit.txt.gz
SNPs stored in total:   2141764

so that's 2.1M. These are the positions you called as heterozygous in your samples. The positions you N-masked in the genome seem to have been all positions mentioned in the 1000 genomes project, and given that 77% of the N-masked positions were not present in your SNP annotation file I assumed that the number SNPs masked from the 1000 genomes project was probably at least 8M in total. If this were the case, it wouldn't really matter much in terms of the outcome, it's really just an attempt to explain the numbers.

ChenfuShi commented 2 years ago

right, that's good then! Thanks!

FelixKrueger / SNPsplit

high reads did not contain one of the expected bases at known SNP positions #67