Open zhxiaokang opened 3 years ago
@zhxiaokang The read sequences are not output by LongGF, but LongGF provides the supporting read information: read id, and breakpoints' coordinates of long reads. With the information, it is easy to grep the supporting reads and their alignments. For example, samtools view your-bam | grep "read-id1\|read-id2\|..."
will output all supporting reads of interest and their aignment for further analysis. I also used this command together with the bam header to visualize how long reads support gene fusion in IGV.
@liuqianhn is there a command you can suggest to pull all of the read ids from the log file?
@majorkazer There is no simple command for this. I wrote a simple python script for this purpose. Please check https://github.com/WGLab/LongGF/tree/master/tools.
Hi @liuqianhn I'm a bit confused by the position in the output. Taking one line in the output of LongGF in my real case:
38584947(+chr21:38584843-38584947/5578729c-a0fc-469f-ab15-8f1c8c11341a:1956-2060)1 41508078(+chr21:41508078-41508162/2056-2143)1
I interpret that the read 5578729c-a0fc-469f-ab15-8f1c8c11341a
is a supporting read of the gene fusion (TMPRSS2-ERG fusion in this case). +chr21:38584843-38584947
and +chr21:41508078-41508162
are the positions on the genome and each of them is part of those two genes that the read was mapped to. 1956-2060
and 2056-2143
are the positions of the read and each of them is mapped to the two segments of the genome mentioned above. So I expect that they can match with each other: +chr21:38584843-38584947
with 1956-2060
and +chr21:41508078-41508162
with 2056-2143
. Is my understanding correct?
UCSC genome browser says that +chr21:38584843-38584947
belongs to an exon of gene ERG, +chr21:41508078-41508162
belongs to an exon of gene TMPRSS2. Let's take the TMPRSS2 part as the study case.
The sequence of +chr21:41508078-41508162
goes as: CACCTGCCGCGCTCCAGGCGGCGCTCCCCGCCCCTCGCCCTCCGCCTCCGCCTCCGCCTCCTGCTTAGCTCGCGCCTACTCGGCC
. Then I extracted the subsequence of read 5578729c-a0fc-469f-ab15-8f1c8c11341a
by position 2056-2143
from the SAM file, and the subsequence goes as: ACTAAAGTAAATAGAACAACACAGTTTTGACCTTAACATACCGTTTAGTAATGCCATTTTAAGGAAAAACTACCTTGTATTTAAAAAA
. Apparently, they can't be matched. So I'm confused. What's going wrong here?
@zhxiaokang Yes, your understanding is correct.
I can confirm that the reference you posted for chr21 is from hg38. But I have no idea how do you get subsequence from sam, since when there are multiple alignments of a long read, only one alignment record has the full sequence, while others do not have. It will be helpful if you can share all alignments of "5578729c-a0fc-469f-ab15-8f1c8c11341a" to know why.
Hi, thank you very much for the quick reply!
I simply used grep
to pick out all the lines containing string 5578729c-a0fc-469f-ab15-8f1c8c11341a
from the SAM file:
grep 5578729c-a0fc-469f-ab15-8f1c8c11341a aln_barcode05.sam > 5578729c-a0fc-469f-ab15-8f1c8c11341a.sam
5578729c-a0fc-469f-ab15-8f1c8c11341a.sam
is not a big file, so I will just paste the full content in that file here:
5578729c-a0fc-469f-ab15-8f1c8c11341a 16 ENST00000417133.6 35 50 84S28M1I19M1D17M1D25M1D4M2I17M1I4M1I12M1I37M1D114M1D43M1D19M1I15M1D45M1D11M1I11M4D29M1I33M2D6M1D3M2I8M1D10M1I15M1D47M1I32M1I7M1I11M2I26M1D45M1D42M1I4M2D9M1I39M2D12M1I16M2I11M2D6M3I8M1D39M1I7M1D21M1I17M1D38M1I23M1I9M1I16M1I10M2D3M1D26M1D16M1D19M1D2M1D5M1D17M1D6M1D29M1D20M1I22M2D30M1D5M2D28M1I6M2I19M2D11M2D2M1D17M1D10M1D33M2D10M1D11M1I41M1I7M1I13M1I17M2I31M1D3M2I5M2I5M1D10M1D15M4D8M1I6M1I8M1D4M2I7M1D18M1I23M3I34M1D24M2D24M1I48M1I17M1I27M1I12M1I6M1I10M2I17M1I4M3D11M1I37M2I4M1D14M1D10M1I5M1I5M1D14M1I6M1D18M1I14M1I16M1I9M1I12M3S * 0 0 TGGGGGGAGTAGGCGCGAGCTTAAGCAGGAGGCGGAGGCGGAGGCGGAGGCGAGGGATGGCGGGGAGCGTCGCCTGAGCAGCGGCAGGTTATTCCAGGATCTTTGGAGACCCGGAGGAAAGTCGTGTTGACCAAAGCAAGACAAATGACCTCAGAGAAAAAAGATGGCAGAAACTGGGTATAACCAAAGCCGTCAGGTTTCTGGAACAGCCGGTAGGATGGGTTGGCTTACTGAAGGACATGATTCAGACTGTCCTGACCCAGCAGCTCATATCAAGGAAGCCTTATCAGTTGTGAGTGAGGACCAGTCGTTGTTTGAGTGTGCCTACGGAACGCCACACCTGGCTAAGACAGAGATGACCGCGTCCCCTCCAGCGACTATGGACAGATTTCCAAGATGAGCCCACGCGTCCTCAGCAGGATTGGTTGTCCTCAACCCCCAGCCAGGTCACCATCAAAATGGAATGCAACCCCAGTCAGGTGAATGGCTCATGAACTCTCCTTGATGAATGCAGGCCAAAGGCGGGAAGATGGTGGGCAGCCTGTGACACCGTTGGGATGAACTACGGCAGCTACAGAGGAGAGCCAATATGCCACCCCAAACATGGACCACGAACGAGCGAGAGTTATCGTGCCAGCAGATCCCACGCTATGGAGTACAGACCATGTGGCGGCAGTGGCTGGAGTGGGCGGTGAAAGAATAATGGCCTCTCCAGACGTCAACACATCTTGTTATTCCAGAACATCGATGGAAGGAACTGTGCAAGATGACCAAGGACGACTTCCAGAGGCTCACCCCAGCCACAACGCCGACATCCTTCTCTCACATCTCCACAATTTTCAGAGACTCCTTCCTCCACATTTGACTTCAGATGATGTTGATAAAGCCTTAAAACTCTCCACTTGTTAATGCATGCTAGGAAAACACAGGGGGCAGCTTTCTTTATTTTCCAAATACTTGGGTATATCCTGAAGCTACGCAAAGAATTGCCAACTAGCTAGATTTACCATATGAGCCTCCTCAGGAGATCAGCCTGACCGGTCACGGCCAATTCACGCCCCAGTCGAAAGCTGCCTCAACCATCTCCTTCCACAGTGCCCCACAACTGGAAGACCAGCGTCCTCAAGCCAGATCCATCGACTTTTGGACCAACAAGTAGCCGCCTGCAAATCCAGGCAGTGCCAGATCCAGCTTTGGCATTCTCCTGAGCTCCTGTCGGACAGTTCAACCCAGCTGCATCACCTGGGAAGGCACCAACGGGAGTTCAAGATGACGGATCCCCGACGAGGTGGCCCGGCGCTGGAGAGCGGAAGAGCAAACCCAACATGAACACGATGCTCAGCCGCGCCCTCCGTTACTACTATGGACAAGCAAACATCCGCACCAAGGTCCGGGAAGCGCTACCACAAGTTCGACTTCCACGGATCGCCCAGCCCTCCAGCCCCACCCCCCGGACACATCTCTGCCAGTACCCCCAGACCTCCCGCAATATGGGCTCCCATCACGCCCACCCACAGAAGATGAACTTCTGTGGCGCCCCCACCCTCCAGCCCCTCCCCGTGACATCTTCTCCAGTTTTTTTGCTGCCCCAAACCTATACTGAATTTTCACCTCAACTGGGGTATATACCCAACACTAGGTTCCCAGCCATGCTGCCTCTTTCATCTGGCACCCTTACTATAAAGACCTGGCGGAGGCTTTTTCCCATCAGCGTGCATCTACCTGCAGCCCATCGCCACAAACTCTATCGGAGAACATGACCAAAAGTGCCTCAAGAGGAATGAAAAGCTTTACTGGGGCTGGGGAAGGGAAGCCGGGGAAGAGATCCAAAGACTCTTGGGAGGGAGTTACTGAAGTTCAGATCACAGAAATGAGGGAGGATGCCAAAAATGTCACGAATATGGGACATATCATCTTGTGAATTTGACCTTGTAAAAAAGACAGTGTATGTAGGAAGGAAGTCTTAAGTGACAAAGTGCCAAAGAAAGTGGTCTTAAGAAATGTATGGGAACTTAGAGTAGAGTTTGAATCCCACTCAATGCAAAACTGGATGAAACTAAAGTAAATAGAACAACACAGTTTTGACCTTAACATACCGTTTAGTAATGCCATTTTAAGGAAAAACTACCTTGTATTTAAAAAAAT ))/.+%('#$'##$�.%"(#%&''/('356+6=.($"6:77820;7,++,)%&&$#$%$054:2/+/$$%'*'.8.*%%$'32:@G85/0;778:L@:(*:969(3450*43'*'-3400-./.,*<<300.-(&**-4-87//>B/)/3(08:<988://6587-438B:5(((&&$#&9'%1.&/,+29<:66#&.$2)@=64(.'$&&%'%465:?A4202$<;?715C8AL<53+/9('8*?>4%*@6+$)47>:;:/27,+,+-<>?1146B?@@)416-.19,*+(.2,+357?/94*8BCDA<:;<9=730*'(&&&,%/19>:=,)$3<93*13)+*+(227>3BD-8C@;<0=)--92:@9<=<65/)&&,==>@=2031+)67<B>A95641?73.03:=2=36902?3FCBA:96+((*129..2*(*252=DG876'$78;=A=:97B=::D<;(69+$25'2,5C??=5533%%&)%*::3,:845%$;C>?>=<=3'%$&-73///.01:167=68816676:71/'((.@79714.+=;?947:-4,.,-/'&(&%'&')(%%($)$#$-$$6$141@9;53@B?2=8578.02$(3'-0)&&&%*)$/.666.$&/.2.3<?=660)5''''&$-:4-,*3469=4?7,',,%-2387545%4,A877>=8/9943A@;=;304+$..(%'$01.;A5:,25:;$)3F?>53?63*6AB@:AA@>G?C6BC@K?67<=>0+)%/-/)9>:??704/5?7:;;9*-,+57;?,$*,*.0(%99(%.&8136'$5<,708*063,9073-*(&&/-.),)(-/9=84&(FA12+/)*)<32=:B>ED@40'/((9A?=H=<95:427>35-(%$+;?4-.-92/2(&#+/.;20,)++(*),(?GPI;;//'.;<6;$+-')))'(7468:2)+&)-))@+(&$#%$*5696/)&-))))+0//7=<=E?==431;>5634/-%)3./,8&$.(+,.9:%$#4+#%'-3/2?<<66,*.1;42;144331,(%)'''2,30+1+*+435+=>/304.%&44?52.B>8*;-24%$+(1)*1-,-?7<29H4/5A4380/<?@8$&%%.'%.072+(6&,)3,,0317C3@9>;=<;..4E.+52(2*&6))'78<AB+,'%/)')3=<53=5&'*(%$$(""$('5)./*'',++/$$'<<.*&'$&/:07691;><3=<>?8/-,8A:/%$2:%$7<859D;C>:;9368,77/0AA>;0)%:C?<<;<+151+0@I><327;&66,'9;=7?ED>>GG<)'8;4C2<,%+).,;7%%,-*60-228.1:,848>996+.-%),2?&40(#$55))-#%##$0+(231?90+//056333-5+)&<50-8-++8)&%%&$-+0?8++4-0-$$(),%>=?@;>;661277)%99($&&(%&%)(%$#$"$3+-2('1=B9=&;+%3*.$$-&-1&&,()'7227>.(,)8--*,%&-641D@@A2/$'#)&,683,%0,20A:>=188;?-2.C5;/)%(((3/145-.&6>:A?B441?GC;)0+(0)';=</*&-368,3<9++-100.*$))'5'(03:49%<>8&'$/6735-2/2'#&0,*1/-.*,'%)##$'$-012@222.)5',&%%)$$'+1<2692/&()'.636;&'+0(/.-*@DE=++6(-380%%&//&(*362*.A=:<411,5:1/)--=<>8C57@733*,90,)),+0/1-(''36;03@;..9CCDA&#%-(;:10<BO>?>D<><,6682&0)'7;:F<51%9@A86<:&;@;1+:8:9<0.39@;:89.1143>*%$%)-##/(*.9;889.(4614@B:51/<>@?==69:1*+899D:<=D0*3*/,,,*('%&%-/)$$'(.+-+,439;;<?98857''+'-1767,90++4<<HC<642)('.<:10-++,&$//-5;7-,,2=911066=;8//4+-&&&&4582/-,.$,2=EBA.+<56$.4455(/.'(+*;?6,4+,:7+2=>-047=1*'.7557@DB0A>+<1145/5331*%(&<;9?77.'8896'&*(%%0(&294A?2-0$;FJF>32,+2(%404362A???>B>> NM:i:209 ms:i:1427 AS:i:1427 nn:i:0 tp:A:P cm:i:106 s1:i:945 s2:i:918 de:f:0.0835 rl:i:0
5578729c-a0fc-469f-ab15-8f1c8c11341a 272 ENST00000442448.5 121 0 84S28M1I19M1D17M1D25M1D4M2I17M1I4M1I12M1I37M1D114M1D43M1D19M1I15M1D45M1D11M1I11M4D29M1I33M2D6M1D3M2I8M1D10M1I15M1D47M1I32M1I7M1I11M2I26M1D45M1D42M1I4M2D9M1I39M2D12M1I16M2I6M72I18M1I17M1D38M1I23M1I9M1I16M1I10M2D3M1D26M1D16M1D19M1D2M1D5M1D17M1D6M1D29M1D20M1I22M2D30M1D5M2D28M1I6M2I19M2D11M2D2M1D17M1D10M1D33M2D10M1D11M1I41M1I7M1I13M1I17M2I31M1D3M2I5M2I5M1D10M1D15M4D8M1I6M1I8M1D4M2I7M1D18M1I23M3I34M1D24M2D24M1I48M1I17M1I27M1I12M1I6M1I10M2I17M1I4M3D11M1I37M2I4M1D14M1D10M1I5M1I5M1D14M1I6M1D18M1I14M1I16M1I9M1I12M3S * 0 0 * * NM:i:269 ms:i:1315 AS:i:1315 nn:i:0 tp:A:S cm:i:105 s1:i:918 de:f:0.0825 rl:i:0
As you said, there are two records of that read but only one record has the raw read, and that's the read I used to extract the subsequence by the position 2056-2143
.
@zhxiaokang I am a little confused by the output here: the two alignments have significant overlap. I remembered that I filter such alignments for supporting gene fusion. If there is no privacy issue, I am happy to have a bam file and test it to see what's wrong.
Hi @liuqianhn there is indeed some privacy issue, since it's humans' sequencing data. May I have your email and discuss with you there?
@liuqianhn - thanks for writing that script to extract the read ids! It works great!
In the output of LongGF, you have the number of supporting reads for each gene fusion detected. Is there an easy way to output those read sequences? As the raw sequences are very useful for downstream analysis and visualization