Closed ChenfuShi closed 2 years ago
Hi, I've re run a bit of the file using the --verbose mode. I don't know if you can have a look at this and see if it's fine. One thing, when I generated the SNP list I only included the sites that were heterozygous. Could that be the reason it gives such a high percentage of unassigned? run_split.sh.txt
Hi @ChenfuShi
Just very generally, the rate at which reads can be assigned to a specific allele is a function of:
I'm not quite sure I understand the IGV image (there are a lot of colours...), but here are a few observations:
Of valid N containing reads,
N was present in the list of known SNPs: 12657518 (22.66%)
N was not present in the list of SNPs: 43212223 (77.34%)
Does that mean that you N-masked a lot more positions than you are actually sorting on?
It might also be a good idea to get @StevenWingett's thoughts on this, as I am not all that familiar with ChIC...
right, I masked all the known 1000 genomes SNPs, so maybe that's where the problem comes from?
Not exactly sure if 'problem' is the right word then (it seems to run fine from a technical point of view), but it would seem that you masked ~4 times more SNPs as Ns than you give SNPsplit to work with during the allele-splitting/sorting process. Still, 6% of allelel-specific reads doesn't sound terrible given you only used 2M SNP positions.....
Yeah, I guess that I can just try the next step of the analysis then. what do you mean I only used 2M SNP positions? should I have expected more? these are all the heterozygous positions that I could call from my samples.
Thanks!
Here is the number of positions you supplied (with the option --snp control1_SNPsplit,txt.gz
):
SNP coverage report
===================
SNP annotation file: control1_SNPsplit.txt.gz
SNPs stored in total: 2141764
so that's 2.1M. These are the positions you called as heterozygous in your samples. The positions you N-masked in the genome seem to have been all positions mentioned in the 1000 genomes project, and given that 77% of the N-masked positions were not present in your SNP annotation file I assumed that the number SNPs masked from the 1000 genomes project was probably at least 8M in total. If this were the case, it wouldn't really matter much in terms of the outcome, it's really just an attempt to explain the numbers.
right, that's good then! Thanks!
Hi, I am analysing some capture Hi-C data using the latest version of SNPsplit (0.5.0). The data was aligned using an N-masked genome using bowtie2 2.3.0 and the default HICUP 0.7.4 settings.
The software seems to run fine and generates the following report:
As you can see there are a lot of reads that did not contain one of the expected bases at known SNP positions. I have been checking the output bam files in IGV and it looks like that there are some reads that contain the correct SNPs but are just stored in the unassigned bam. Interestingly some reads from the same variant are split correctly, so I imagine the annotation file works correctly.
for example:
Is there anything I can do to make it better? Thanks!