Closed weishwu closed 5 years ago
This is already done internally by bam2hits, don't worry. Using --best --strata just helps get rid of some of the sub-optimal alignments beforehand.
@eturro
Thanks for your quick reply.
But I'm seeing some obviously wrong results which was caused by this issue. For example, I have a transcript that contains only a single SNP. When I quantified allele-specific expression at this SNP, it clearly showed strong paternal-biased expression (more than 95% reads support paternal allele), however the results from MMSeq showed strong maternal-biased expression. After some detective work, I figured it out that the major (not 100%) problem is that bowtie could not distinguish between the paternal version and the maternal version of this transcript because it is only a single nucleotide mismatch (because the default allows 2 mismatches) so it reported alignment to both in equal amount. I tried redoing the alignment by requiring zero mismatch then the results showed correct paternal-biased expression. For other transcripts, this does not necessarily cause flipped direction but can reduce the amplitude of allele-specific expression. Only after I required zero mismatch the results started to make sense. But rather than requiring zero mismatch, the ideal way is to let bowtie report only the best stratum.
@eturro
Let me show you an example:
Below is mmseq results for the transcripts of IGF2. IGF2 is a maternally-imprinted gene in my samples. Column 4 and 5 is log_mu of paternal and maternal versions respectively and column 5 is column4-column5 (log ratio): ENST00000381392.5 ENSG00000167244.21 IGF2 1.56433 -2.22791 3.79224 ENST00000381395.5 ENSG00000167244.21 IGF2 9.766810000000001 4.06606 5.700750000000001 ENST00000381406.8 ENSG00000167244.21 IGF2 8.01555 -6.4431 14.458649999999999 ENST00000416167.7 ENSG00000167244.21 IGF2 7.79913 -4.31979 12.11892 ENST00000434045.6 ENSG00000167244.21 IGF2 -7.343839999999999 4.61276 -11.956599999999998
As you can see, while all other transcripts are maternally imprinted, ENST00000434045.6 is paternally imprinted. And the first two transcripts show lower degree of imprinting than the next two. When I looked at SNP level ASE, all the SNPs in these transcripts have strong maternal imprinting.
I tried rerunning it by requiring zero mismatch in bowtie mapping, and got these results: ENST00000381392.5 ENSG00000167244.21 IGF2 7.73735 -4.58045 12.3178 ENST00000381395.5 ENSG00000167244.21 IGF2 9.3704 2.71391 6.65649 ENST00000381406.8 ENSG00000167244.21 IGF2 7.44733 -5.74605 13.193380000000001 ENST00000416167.7 ENSG00000167244.21 IGF2 7.612660000000001 -2.5921 10.20476 ENST00000434045.6 ENSG00000167244.21 IGF2 4.867319999999999 -9.99843 14.86575
Now all transcripts show strong maternal imprinting, consistent with the SNP level results. Well, I'd expect ENST00000381395.5 to be similar with others, but at least there is no flipping any more.
The diploid transcriptome was created using the phased SNPs, so the transcript level ASE should be consistent with the SNP level ASE when all SNPs in this gene are strongly maternally imprinted.
Could you use hitstools on the hits file to identify a read pair that is not being assigned correctly to the best-matching transcript(s) only?
The "NM" tags of two mapped reads in a pair are added up and used to retain only the read pairs in the best mismatch stratum, so this shouldn't happen...
On 11 Nov 2019, at 16:07, weishwu notifications@github.com wrote:
Let me show you an example:
Below is mmseq results for the transcripts of IGF2. IGF2 is a maternally-imprinted gene in my samples. Column 4 and 5 is log_mu of paternal and maternal versions respectively and column 5 is column4-column5 (log ratio): ENST00000381392.5 ENSG00000167244.21 IGF2 1.56433 -2.22791 3.79224 ENST00000381395.5 ENSG00000167244.21 IGF2 9.766810000000001 4.06606 5.700750000000001 ENST00000381406.8 ENSG00000167244.21 IGF2 8.01555 -6.4431 14.458649999999999 ENST00000416167.7 ENSG00000167244.21 IGF2 7.79913 -4.31979 12.11892 ENST00000434045.6 ENSG00000167244.21 IGF2 -7.343839999999999 4.61276 -11.956599999999998
As you can see, while all other transcripts are maternally imprinted, ENST00000434045.6 is paternally imprinted. And the first two transcripts show lower degree of imprinting than the next two. When I looked at SNP level ASE, all the SNPs in these transcripts have strong maternal imprinting.
I tried rerun it by requiring zero mismatch in bowtie mapping, and got these results: ENST00000381392.5 ENSG00000167244.21 IGF2 7.73735 -4.58045 12.3178 ENST00000381395.5 ENSG00000167244.21 IGF2 9.3704 2.71391 6.65649 ENST00000381406.8 ENSG00000167244.21 IGF2 7.44733 -5.74605 13.193380000000001 ENST00000416167.7 ENSG00000167244.21 IGF2 7.612660000000001 -2.5921 10.20476 ENST00000434045.6 ENSG00000167244.21 IGF2 4.867319999999999 -9.99843 14.86575
As you can see, now all transcripts show strong maternal imprinting, consistent with the SNP level results.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.
I'll take a look into the hits file. Sorry I opened a duplicated post because I thought this one was closed and lost. You can close and remove the other one. Thanks.
Here is one example.
Read "K00135:270:HTV7LBBXX:5:1101:10013:12902" was aligned to both ENST00000434045.6_A (with no mismatch) and ENST00000434045.6_B (with one mismatch) by Bowtie.
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_B 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:1 MD:Z:82C2 NM:i:1 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_A 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:0 MD:Z:85 NM:i:0 XM:i:15
And below is from bam2hits output:
K00135:270:HTV7LBBXX:5:1101:10013:12902 ENST00000356578.8_A ENST00000356578.8_B ENST00000381392.5_A ENST00000381392.5_B ENST00000381395.5_A ENST00000381395.5_B ENST00000381406.8_A ENST00000381406.8_B ENST00000416167.7_A ENST00000416167.7_B ENST00000434045.6_A ENST00000434045.6_B ENST00000643349.1_A ENST00000643349.1_B
This is my command line to run bowtie, picard and bam2hits:
bowtie -a --best --strata -S -m 100 -X 500 --chunkmbs 256 -p 2 Sample_118389_diploid_transcripts_forMMSeq -1 Sample_105284_R1_trimmed.fastq.gz -2 Sample_105284_R2_trimmed.fastq.gz >Sample_105284_bowtieDiploid.sam
gatk SortSam -I Sample_105284_bowtieDiploid.sam -O Sample_105284_bowtieDiploid.bam -SO queryname
bam2hits-linux Sample_118389_diploid_transcripts_forMMSeq.fa Sample_105284_bowtieDiploid.bam > Sample_105284_bowtieDiploid.hits
bam2hits version: 1.0.9
Since "--best --strata" was turned on, I assume the problem was because it was not applied to PE. Is there an option in bam2hits to turn on paired-end mode or it's automatic? I didn't see an option.
You added those asterisks manually, I assume.
It works with paired-end mode automatically. You haven't shown all the alignments for this read against these two transcripts, have you? If it's paired-end then you should have two entries per transcript for this read but you're only showing one entry for each of these two transcripts.
On 12 Nov 2019, at 00:11, weishwu notifications@github.com wrote:
Here is one example.
Read "K00135:270:HTV7LBBXX:5:1101:10013:12902" was aligned to both ENST00000434045.6_A (with no mismatch) and ENST00000434045.6_B (with one mismatch) by Bowtie.
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_B 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:1 MD:Z:82C2 NM:i:1 XM:i:15 K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_A 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:0 MD:Z:85 NM:i:0 XM:i:15 And below is from bam2hits output:
K00135:270:HTV7LBBXX:5:1101:10013:12902 ENST00000356578.8_A ENST00000356578.8_B ENST00000381392.5_A ENST00000381392.5_B ENST00000381395.5_A ENST00000381395.5_B ENST00000381406.8_A ENST00000381406.8_B ENST00000416167.7_A ENST00000416167.7_B ENST00000434045.6_A ENST00000434045.6_B ENST00000643349.1_A ENST00000643349.1_B
Is there an option in bam2hits to turn on paired-end mode?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eturro/mmseq/issues/42?email_source=notifications&email_token=ABBTMEATOFWBSTTUSJZ4DE3QTI3J3A5CNFSM4JL2SFZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZBR6Y#issuecomment-552737019, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBTMEHY2FJHNNKJWX5Y3F3QTI3J3ANCNFSM4JL2SFZQ.
I added asterisks to bold the two transcripts, if that was what you meant.
Here are all alignments from the pair in this transcript. R2 mapped to both _A and _B without mismatch because it fell in the region with no SNP. R1 fell in the region with one SNP so it mapped to _B with one mismatch. I'd like to retain only the mapping to _A in this case.
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_B 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:1 MD:Z:82C2 NM:i:1 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_A 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:0 MD:Z:85 NM:i:0 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 163 ENST00000434045.6_B 1130 255 134M = 1353 308 CCATCCTGCAGCCTCCTCCTGACCACGGACGTTTCCATCAGGTTCCATCCCGAAAATCTCTCGGTTCCACGTCCCCCTGGGGCTTCTCCTGACCCAGTCCCCGTGCCCCGCCTCCCCGAAACAGGCTACTCTCC AAFFFJJJJJJJJAJJJJJJJJJJJJJJJJJJJFJJJJJJJJFFJJJJJJJJJJJJJJJJJJAJJJJJJJFJJJJJAAJFAAJJJJJFFJJJJJJJJJJJJJJJJJJJJ<JJ<AJJJ<FJJJJA-AF-FFA<7F XA:i:0 MD:Z:134 NM:i:0 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 163 ENST00000434045.6_A 1130 255 134M = 1353 308 CCATCCTGCAGCCTCCTCCTGACCACGGACGTTTCCATCAGGTTCCATCCCGAAAATCTCTCGGTTCCACGTCCCCCTGGGGCTTCTCCTGACCCAGTCCCCGTGCCCCGCCTCCCCGAAACAGGCTACTCTCC AAFFFJJJJJJJJAJJJJJJJJJJJJJJJJJJJFJJJJJJJJFFJJJJJJJJJJJJJJJJJJAJJJJJJJFJJJJJAAJFAAJJJJJFFJJJJJJJJJJJJJJJJJJJJ<JJ<AJJJ<FJJJJA-AF-FFA<7F XA:i:0 MD:Z:134 NM:i:0 XM:i:15
Yes, I agree that only the alignments to ENST00000434045.6_A should have been kept. Alignments to ENST00000434045.6_B should have been removed by the stratum filter in bam2hits.
Could you provide a link to a minimal set of files to reproduce this issue please?
I just need the fasta file and a mini bam so I can run bam2hits.
thanks Ernest
On 12 Nov 2019, at 10:48, weishwu notifications@github.com wrote:
I added asterisks to bold the two transcripts, if that was what you meant.
Here all alignments from the pairs. R2 mapped to both _A and _B without mismatch because it fell in the region with no SNP. R1 fell in the region with one SNP so it mapped to _B with one mismatch.
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_B 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:1 MD:Z:82C2 NM:i:1 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 83 ENST00000434045.6_A 1353 255 85M = 1130 -308 CCCCCAAATTATCCCCAATTATCCCCACACATAAAAAATCAAAACATTAAACTAACCCCCTTCCCCCCCCCCCACAACAACCGTC 7<-7777-AA-J-<A77<FJAFJJJJJFJJJJJJJJJJJJJJJJJFFFAJFJFJJJJJFAAFJJJJJJJJJJJFJFAJFJFFAAA XA:i:0 MD:Z:85 NM:i:0 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 163 ENST00000434045.6_B 1130 255 134M = 1353 308 CCATCCTGCAGCCTCCTCCTGACCACGGACGTTTCCATCAGGTTCCATCCCGAAAATCTCTCGGTTCCACGTCCCCCTGGGGCTTCTCCTGACCCAGTCCCCGTGCCCCGCCTCCCCGAAACAGGCTACTCTCC AAFFFJJJJJJJJAJJJJJJJJJJJJJJJJJJJFJJJJJJJJFFJJJJJJJJJJJJJJJJJJAJJJJJJJFJJJJJAAJFAAJJJJJFFJJJJJJJJJJJJJJJJJJJJ<JJ<AJJJ<FJJJJA-AF-FFA<7F XA:i:0 MD:Z:134 NM:i:0 XM:i:15
K00135:270:HTV7LBBXX:5:1101:10013:12902 163 ENST00000434045.6_A 1130 255 134M = 1353 308 CCATCCTGCAGCCTCCTCCTGACCACGGACGTTTCCATCAGGTTCCATCCCGAAAATCTCTCGGTTCCACGTCCCCCTGGGGCTTCTCCTGACCCAGTCCCCGTGCCCCGCCTCCCCGAAACAGGCTACTCTCC AAFFFJJJJJJJJAJJJJJJJJJJJJJJJJJJJFJJJJJJJJFFJJJJJJJJJJJJJJJJJJAJJJJJJJFJJJJJAAJFAAJJJJJFFJJJJJJJJJJJJJJJJJJJJ<JJ<AJJJ<FJJJJA-AF-FFA<7F XA:i:0 MD:Z:134 NM:i:0 XM:i:15
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I've shared the data with you (et341@cam.ac.uk) by gdrive. Let me know if you didn't get it:
ENST00000434045_aln.bam: alignment in transcript ENST00000434045_A and ENST00000434045_B Sample_118389_diploid_transcripts_forMMSeq.fa.gz: the full fasta file used as the reference in alignment ENST00000434045.fa: the sequences of ENST00000434045_A and ENST00000434045_B only
Thanks!
Any news? Thanks.
Yes I found a bug but in very busy these days travelling. Will send fix soon. Thanks
Ernest
On 14 Nov 2019, at 13:14, weishwu notifications@github.com wrote:
Any news? Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
OK. Thanks!
Could you try the bam2hits.cpp in the new testing branch please?
On 14 Nov 2019, at 20:10, weishwu notifications@github.com wrote:
OK. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Could you tell me how to use bam2hits.cpp? Thanks.
You need to compile it with make bam2hits
and then you can run it. Have you not been compiling? I.e. have you been using the statically pre-compiled binaries?
On 19 Nov 2019, at 03:53, weishwu notifications@github.com wrote:
Could you tell me how to use bam2hits.cpp? Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I did try compiling it but it failed by saying:
In file included from hitsio.cpp:5:0: hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^
compilation terminated. hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^
compilation terminated. In file included from hitsio.cpp:5:0: hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^
compilation terminated. make: *** [hitsio.o] Error 1
I tried installing hitslib but it also complained not being able to find boost libs (regex.hpp etc). I tried editing the path to boost in bam2hits.cpp but it didn't work. I was not sure if I was on the right track.
You need GCC >5.3, hitslib, gsl, lapack, armadillo, boost >1.64, zlib
On 19 Nov 2019, at 15:20, weishwu notifications@github.com wrote:
I did try compiling it but it failed by saying:
In file included from hitsio.cpp:5:0: hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^ compilation terminated. hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^ compilation terminated. In file included from hitsio.cpp:5:0: hitsio.hpp:5:49: fatal error: boost/serialization/array_wrapper.hpp: No such file or directory
include <boost/serialization/array_wrapper.hpp>
^ compilation terminated. make: *** [hitsio.o] Error 1
I tried installing hitslib but it also complained not being able to find boost libs (regex.hpp etc). Sorry I'm not familiar with this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Both my colleague and I tried but we were not able to compile it. I wondered if you could please provide the binary file. Thanks.
Pushed binary to testing in commit 08197b3..e92b69c
Let me know how you get on please
On 20 Nov 2019, at 18:48, weishwu notifications@github.com wrote:
Both my colleague and I tried but we were not be able to compile it. I wondered if you could please provide the binary file. Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I checked the same read in the new bam2hits output and now only the hits on the paternal transcript are reported. I think the bug was fixed. Thanks a lot!
$ grep K00135:270:HTV7LBBXX:5:1101:10013:12902 Sample_105284_bowtieDiploid.hits.txt -A8
>K00135:270:HTV7LBBXX:5:1101:10013:12902
ENST00000356578.8_A
ENST00000381392.5_A
ENST00000381395.5_A
ENST00000381406.8_A
ENST00000416167.7_A
ENST00000434045.6_A
ENST00000643349.1_A
>K00135:270:HTV7LBBXX:5:1101:10013:16243
By the way, if I use bowtie2 to map the reads, can bam2hits still do this stratum selection? I see the MMSeq manual specifically says "not Bowtie2" and was wondering why. I understand bowtie1 has some unique and convenient parameters such as "-m and -strata". But is bowtie2 better for reads longer than 50bp? My read length is 150bp. I'm just curious if MMSeq can also take bowtie2 alignments.
I'm sorry but something does not make sense to me. Previously, though the quantification of ENST00000434045.6 expression was wrong, but three other IGF2 transcripts made sense:
transcript geneSymbol log_mu_paternal log_mu_maternal
ENST00000381392.5 IGF2 1.56433 -2.22791
ENST00000381395.5 IGF2 9.766810000000001 4.06606
ENST00000381406.8 IGF2 8.01555 -6.4431
ENST00000416167.7 IGF2 7.79913 -4.31979
ENST00000434045.6 IGF2 -7.343839999999999 4.61276
As you can see for ENST00000381406.8 and ENST00000416167.7 paternal transcripts were highly expressed while maternal transcripts were almost not expressed, which was consistent with both our expectation and our SNP level data.
However, the MMSeq results from the new bam2hits output show that both paternal and maternal IGF2 transcripts are expressed and the levels are close:
transcript geneSymbol log_mu_paternal log_mu_maternal
ENST00000381392.5 IGF2 6.25017 6.29855
ENST00000381395.5 IGF2 8.56896 7.478510000000001
ENST00000381406.8 IGF2 8.153410000000001 7.316730000000001
ENST00000416167.7 IGF2 8.05054 7.3203
ENST00000434045.6 IGF2 6.12244 5.3816
I thought by selecting the best stratum the new results should show even stronger paternal preference for IGF2 because the hits on the maternal transcripts were now eliminated, but the MMSeq results actually go the other way.
If I simply count the number of hits on ENST00000381406.8 the data was as below from the previous bam2hits output:
2824535 ENST00000381406.8_A
2481924 ENST00000381406.8_B
And this is from the new bam2hits output:
2054766 ENST00000381406.8_A
1570441 ENST00000381406.8_B
It seems there is a lot of hassle by using bowtie as the aligner. If I use kallisto could I use the bam2hits in your official release? Do I still have to worry about the stratum issue?
As the Bowtie manual says, "--best --strata" does not apply for paired-end reads. What's the best solution to this? I really need to get only the best stratum.