FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

SNP level counting showing bi and mono allele count for same gene #68

Closed luckysardar closed 1 year ago

luckysardar commented 1 year ago

Hello Felix

SNPsplit is a great tool its helping me a lot in my analysis. To validate the output file I got snp level count from ASE readcounter for each gene for two separate genomes. individual genes are having handful number of SNPs but one observation I made is between SNPs in a gene showing both monoallelic and biallelic SNP count can you please suggest me what is the issue going on strain I am having is 129S1 vs CAST and am attaching one example file

FelixKrueger commented 1 year ago

Just to understand your setup, you constructed a dual hybrid 129S1/CAST strain with SNPsplit, then aligned RNA-seq data against it, and then did what exactly, run ASE readcounter, using which files? Where did you get the positions (and annotations) from, and how does this compare to the filtered list that is used by SNPsplit?

Conceptually, I think it is fairly easy to picture a scenario like this one:

ENSMUSG00000028989  4   148495681   30  48  Angptl7
ENSMUSG00000028989  4   148495686   41  51  Angptl7
ENSMUSG00000028989  4   148495785   47  28  Angptl7
ENSMUSG00000028989  4   148495809   45  32  Angptl7
ENSMUSG00000028989  4   148495841   37  33  Angptl7
ENSMUSG00000028989  4   148495844   37  32  Angptl7
ENSMUSG00000028989  4   148495966   28  16  Angptl7
ENSMUSG00000028989  4   148495980   31  16  Angptl7
ENSMUSG00000028989  4   148496076   24  36  Angptl7
ENSMUSG00000028989  4   148496128   21  39  Angptl7
ENSMUSG00000028989  4   148496299   26  26  Angptl7
ENSMUSG00000028989  4   148498078   57  49  Angptl7
ENSMUSG00000028989  4   148500125   0   72  Angptl7
ENSMUSG00000028989  4   148500308   25  25  Angptl7
ENSMUSG00000028989  4   148500456   13  12  Angptl7

Where you seem to have biallelic counts for most positions in the Angptl7 gene, only position 148500125 seems to be exclusively the CAST allele. In such a case, it would seem likely that the annotated position in 129S1 is incorrect, i.e. both alleles in your strain harbour the same (CAST) allele in this case.

Generally, I would be worried if you are getting very high counts for one allele, and exactly 0 for the other allele, as this might be a dropout. Situations like this look more credible to me:

ENSMUSG00000042271  X   142238172   0   27  Nxt2
ENSMUSG00000042271  X   142238313   2   14  Nxt2
ENSMUSG00000042271  X   142238460   6   42  Nxt2
ENSMUSG00000042271  X   142238554   2   43  Nxt2
ENSMUSG00000042271  X   142238748   0   23  Nxt2
ENSMUSG00000042271  X   142238809   2   27  Nxt2
ENSMUSG00000042271  X   142238836   1   38  Nxt2
ENSMUSG00000042271  X   142238967   0   16  Nxt2

Generally, I don't think I would have a biological explanation why you would see reads for only one allele for some part of a gene, and bi-allelic expression for another part. In a case like this:

ENSMUSG00000042396  9   48489361    0   269 Rbm7
ENSMUSG00000042396  9   48489437    0   320 Rbm7
ENSMUSG00000042396  9   48494073    145 302 Rbm7

You might have bi-allelic expression of Rbm7 and both positions 48489361 and 48489437 are incorrectly annotated for the 129S1 allele, or maybe something is 'wrong' with position 48494073 (which is 5kb further downstream)? Maybe that position is overlaps with another gene, or an antisense transcript of some sort? This is impossible to answer just by looking at some numbers, I would suggest you import the 129S1, CAST and Unassigned reads next to each other (e.g. in SeqMonk) and see if that yields a clue?

FelixKrueger commented 1 year ago

Haven't heard back in a while, New-Year-Closing.