WGLab / LongGF

A computational algorithm and software tool for fast and accurate detection of gene fusion by long-read transcriptome sequencing
GNU General Public License v3.0
22 stars 1 forks source link

Is there any way to output the read sequence that support the gene fusion? #10

Open zhxiaokang opened 3 years ago

zhxiaokang commented 3 years ago

In the output of LongGF, you have the number of supporting reads for each gene fusion detected. Is there an easy way to output those read sequences? As the raw sequences are very useful for downstream analysis and visualization

liuqianhn commented 3 years ago

@zhxiaokang The read sequences are not output by LongGF, but LongGF provides the supporting read information: read id, and breakpoints' coordinates of long reads. With the information, it is easy to grep the supporting reads and their alignments. For example, samtools view your-bam | grep "read-id1\|read-id2\|..." will output all supporting reads of interest and their aignment for further analysis. I also used this command together with the bam header to visualize how long reads support gene fusion in IGV.

majorkazer commented 3 years ago

@liuqianhn is there a command you can suggest to pull all of the read ids from the log file?

liuqianhn commented 2 years ago

@majorkazer There is no simple command for this. I wrote a simple python script for this purpose. Please check https://github.com/WGLab/LongGF/tree/master/tools.

zhxiaokang commented 2 years ago

Hi @liuqianhn I'm a bit confused by the position in the output. Taking one line in the output of LongGF in my real case:

38584947(+chr21:38584843-38584947/5578729c-a0fc-469f-ab15-8f1c8c11341a:1956-2060)1 41508078(+chr21:41508078-41508162/2056-2143)1

I interpret that the read 5578729c-a0fc-469f-ab15-8f1c8c11341a is a supporting read of the gene fusion (TMPRSS2-ERG fusion in this case). +chr21:38584843-38584947 and +chr21:41508078-41508162 are the positions on the genome and each of them is part of those two genes that the read was mapped to. 1956-2060 and 2056-2143 are the positions of the read and each of them is mapped to the two segments of the genome mentioned above. So I expect that they can match with each other: +chr21:38584843-38584947 with 1956-2060 and +chr21:41508078-41508162 with 2056-2143. Is my understanding correct?

UCSC genome browser says that +chr21:38584843-38584947 belongs to an exon of gene ERG, +chr21:41508078-41508162 belongs to an exon of gene TMPRSS2. Let's take the TMPRSS2 part as the study case.

The sequence of +chr21:41508078-41508162 goes as: CACCTGCCGCGCTCCAGGCGGCGCTCCCCGCCCCTCGCCCTCCGCCTCCGCCTCCGCCTCCTGCTTAGCTCGCGCCTACTCGGCC. Then I extracted the subsequence of read 5578729c-a0fc-469f-ab15-8f1c8c11341a by position 2056-2143 from the SAM file, and the subsequence goes as: ACTAAAGTAAATAGAACAACACAGTTTTGACCTTAACATACCGTTTAGTAATGCCATTTTAAGGAAAAACTACCTTGTATTTAAAAAA. Apparently, they can't be matched. So I'm confused. What's going wrong here?

liuqianhn commented 2 years ago

@zhxiaokang Yes, your understanding is correct.

I can confirm that the reference you posted for chr21 is from hg38. But I have no idea how do you get subsequence from sam, since when there are multiple alignments of a long read, only one alignment record has the full sequence, while others do not have. It will be helpful if you can share all alignments of "5578729c-a0fc-469f-ab15-8f1c8c11341a" to know why.

zhxiaokang commented 2 years ago

Hi, thank you very much for the quick reply!

I simply used grep to pick out all the lines containing string 5578729c-a0fc-469f-ab15-8f1c8c11341a from the SAM file:

grep 5578729c-a0fc-469f-ab15-8f1c8c11341a aln_barcode05.sam > 5578729c-a0fc-469f-ab15-8f1c8c11341a.sam

5578729c-a0fc-469f-ab15-8f1c8c11341a.sam is not a big file, so I will just paste the full content in that file here:

5578729c-a0fc-469f-ab15-8f1c8c11341a    16  ENST00000417133.6   35  50  84S28M1I19M1D17M1D25M1D4M2I17M1I4M1I12M1I37M1D114M1D43M1D19M1I15M1D45M1D11M1I11M4D29M1I33M2D6M1D3M2I8M1D10M1I15M1D47M1I32M1I7M1I11M2I26M1D45M1D42M1I4M2D9M1I39M2D12M1I16M2I11M2D6M3I8M1D39M1I7M1D21M1I17M1D38M1I23M1I9M1I16M1I10M2D3M1D26M1D16M1D19M1D2M1D5M1D17M1D6M1D29M1D20M1I22M2D30M1D5M2D28M1I6M2I19M2D11M2D2M1D17M1D10M1D33M2D10M1D11M1I41M1I7M1I13M1I17M2I31M1D3M2I5M2I5M1D10M1D15M4D8M1I6M1I8M1D4M2I7M1D18M1I23M3I34M1D24M2D24M1I48M1I17M1I27M1I12M1I6M1I10M2I17M1I4M3D11M1I37M2I4M1D14M1D10M1I5M1I5M1D14M1I6M1D18M1I14M1I16M1I9M1I12M3S   *   0   0   TGGGGGGAGTAGGCGCGAGCTTAAGCAGGAGGCGGAGGCGGAGGCGGAGGCGAGGGATGGCGGGGAGCGTCGCCTGAGCAGCGGCAGGTTATTCCAGGATCTTTGGAGACCCGGAGGAAAGTCGTGTTGACCAAAGCAAGACAAATGACCTCAGAGAAAAAAGATGGCAGAAACTGGGTATAACCAAAGCCGTCAGGTTTCTGGAACAGCCGGTAGGATGGGTTGGCTTACTGAAGGACATGATTCAGACTGTCCTGACCCAGCAGCTCATATCAAGGAAGCCTTATCAGTTGTGAGTGAGGACCAGTCGTTGTTTGAGTGTGCCTACGGAACGCCACACCTGGCTAAGACAGAGATGACCGCGTCCCCTCCAGCGACTATGGACAGATTTCCAAGATGAGCCCACGCGTCCTCAGCAGGATTGGTTGTCCTCAACCCCCAGCCAGGTCACCATCAAAATGGAATGCAACCCCAGTCAGGTGAATGGCTCATGAACTCTCCTTGATGAATGCAGGCCAAAGGCGGGAAGATGGTGGGCAGCCTGTGACACCGTTGGGATGAACTACGGCAGCTACAGAGGAGAGCCAATATGCCACCCCAAACATGGACCACGAACGAGCGAGAGTTATCGTGCCAGCAGATCCCACGCTATGGAGTACAGACCATGTGGCGGCAGTGGCTGGAGTGGGCGGTGAAAGAATAATGGCCTCTCCAGACGTCAACACATCTTGTTATTCCAGAACATCGATGGAAGGAACTGTGCAAGATGACCAAGGACGACTTCCAGAGGCTCACCCCAGCCACAACGCCGACATCCTTCTCTCACATCTCCACAATTTTCAGAGACTCCTTCCTCCACATTTGACTTCAGATGATGTTGATAAAGCCTTAAAACTCTCCACTTGTTAATGCATGCTAGGAAAACACAGGGGGCAGCTTTCTTTATTTTCCAAATACTTGGGTATATCCTGAAGCTACGCAAAGAATTGCCAACTAGCTAGATTTACCATATGAGCCTCCTCAGGAGATCAGCCTGACCGGTCACGGCCAATTCACGCCCCAGTCGAAAGCTGCCTCAACCATCTCCTTCCACAGTGCCCCACAACTGGAAGACCAGCGTCCTCAAGCCAGATCCATCGACTTTTGGACCAACAAGTAGCCGCCTGCAAATCCAGGCAGTGCCAGATCCAGCTTTGGCATTCTCCTGAGCTCCTGTCGGACAGTTCAACCCAGCTGCATCACCTGGGAAGGCACCAACGGGAGTTCAAGATGACGGATCCCCGACGAGGTGGCCCGGCGCTGGAGAGCGGAAGAGCAAACCCAACATGAACACGATGCTCAGCCGCGCCCTCCGTTACTACTATGGACAAGCAAACATCCGCACCAAGGTCCGGGAAGCGCTACCACAAGTTCGACTTCCACGGATCGCCCAGCCCTCCAGCCCCACCCCCCGGACACATCTCTGCCAGTACCCCCAGACCTCCCGCAATATGGGCTCCCATCACGCCCACCCACAGAAGATGAACTTCTGTGGCGCCCCCACCCTCCAGCCCCTCCCCGTGACATCTTCTCCAGTTTTTTTGCTGCCCCAAACCTATACTGAATTTTCACCTCAACTGGGGTATATACCCAACACTAGGTTCCCAGCCATGCTGCCTCTTTCATCTGGCACCCTTACTATAAAGACCTGGCGGAGGCTTTTTCCCATCAGCGTGCATCTACCTGCAGCCCATCGCCACAAACTCTATCGGAGAACATGACCAAAAGTGCCTCAAGAGGAATGAAAAGCTTTACTGGGGCTGGGGAAGGGAAGCCGGGGAAGAGATCCAAAGACTCTTGGGAGGGAGTTACTGAAGTTCAGATCACAGAAATGAGGGAGGATGCCAAAAATGTCACGAATATGGGACATATCATCTTGTGAATTTGACCTTGTAAAAAAGACAGTGTATGTAGGAAGGAAGTCTTAAGTGACAAAGTGCCAAAGAAAGTGGTCTTAAGAAATGTATGGGAACTTAGAGTAGAGTTTGAATCCCACTCAATGCAAAACTGGATGAAACTAAAGTAAATAGAACAACACAGTTTTGACCTTAACATACCGTTTAGTAATGCCATTTTAAGGAAAAACTACCTTGTATTTAAAAAAAT   ))/.+%('#$'##$&#0.%"(#%&''/('356+6=.($"6:77820;7,++,)%&&$#$%$054:2/+/$$%'*'.8.*%%$'32:@G85/0;778:L@:(*:969(3450*43'*'-3400-./.,*<<300.-(&**-4-87//>B/)/3(08:<988://6587-438B:5(((&&$#&9'%1.&/,+29<:66#&.$2)@=64(.'$&&%'%465:?A4202$<;?715C8AL<53+/9('8*?>4%*@6+$)47>:;:/27,+,+-<>?1146B?@@)416-.19,*+(.2,+357?/94*8BCDA<:;<9=730*'(&&&,%/19>:=,)$3<93*13)+*+(227>3BD-8C@;<0=)--92:@9<=<65/)&&,==>@=2031+)67<B>A95641?73.03:=2=36902?3FCBA:96+((*129..2*(*252=DG876'$78;=A=:97B=::D<;(69+$25'2,5C??=5533%%&)%*::3,:845%$;C>?>=<=3'%$&-73///.01:167=68816676:71/'((.@79714.+=;?947:-4,.,-/'&(&%'&')(%%($)$#$-$$6$141@9;53@B?2=8578.02$(3'-0)&&&%*)$/.666.$&/.2.3<?=660)5''''&$-:4-,*3469=4?7,',,%-2387545%4,A877>=8/9943A@;=;304+$..(%'$01.;A5:,25:;$)3F?>53?63*6AB@:AA@>G?C6BC@K?67<=>0+)%/-/)9>:??704/5?7:;;9*-,+57;?,$*,*.0(%99(%.&8136'$5<,708*063,9073-*(&&/-.),)(-/9=84&(FA12+/)*)<32=:B>ED@40'/((9A?=H=<95:427>35-(%$+;?4-.-92/2(&#+/.;20,)++(*),(?GPI;;//'.;<6;$+-')))'(7468:2)+&)-))@+(&$#%$*5696/)&-))))+0//7=<=E?==431;>5634/-%)3./,8&$.(+,.9:%$#4+#%'-3/2?<<66,*.1;42;144331,(%)'''2,30+1+*+435+=>/304.%&44?52.B>8*;-24%$+(1)*1-,&#3-?7<29H4/5A4380/<?@8$&%%.'%.072+(6&,)3,,0317C3@9>;=<;..4E.+52(2*&6))'78<AB+,'%/)')3=<53=5&'*(%$$(""$('5)./*'',++/$$'<<.*&'$&/:07691;><3=<>?8/-,8A:/%$2:%$7<859D;C>:;9368,77/0AA>;0)%:C?<<;<+151+0@I><327;&66,'9;=7?ED>>GG<)'8;4C2<,%+).,;7%%,-*60-228.1:,848>996+.-%),2?&40(#$55))-#%##$0+(231?90+//056333-5+)&<50-8-++8)&%%&$-+0?8++4-0-$$(),%>=?@;>;661277)%99($&&(%&%)(%$#$"$3+-2('1=B9=&;+%3*.$$-&-1&&,()'7227>.(,)8--*,%&-641D@@A2/$'#)&,683,%0,20A:>=188;?-2.C5;/)%(((3/145-.&6>:A?B441?GC;)0+(0)';=</*&-368,3<9++-100.*$))'5'(03:49%<>8&'$/6735-2/2'#&0,*1/-.*,'%)##$'$-012@222.)5',&%%)$$'+1<2692/&()'.636;&'+0(/.-*@DE=++6(-380%%&//&(*362*.A=:<411,5:1/)--=<>8C57@733*,90,)),+0/1-(''36;03@;..9CCDA&#%-(;:10<BO>?>D<><,6682&0)'7;:F<51%9@A86<:&;@;1+:8:9<0.39@;:89.1143>*%$%)-##/(*.9;889.(4614@B:51/<>@?==69:1*+899D:<=D0*3*/,,,*('%&%-/)$$'(.+-+,439;;<?98857''+'-1767,90++4<<HC<642)('.<:10-++,&$//-5;7-,,2=911066=;8//4+-&&&&4582/-,.$,2=EBA.+<56$.4455(/.'(+*;?6,4+,:7+2=>-047=1*'.7557@DB0A>+<1145/5331*%(&<;9?77.'8896'&*(%%0(&294A?2-0$;FJF>32,+2(%404362A???>B>>   NM:i:209    ms:i:1427   AS:i:1427   nn:i:0  tp:A:P  cm:i:106    s1:i:945    s2:i:918    de:f:0.0835 rl:i:0
5578729c-a0fc-469f-ab15-8f1c8c11341a    272 ENST00000442448.5   121 0   84S28M1I19M1D17M1D25M1D4M2I17M1I4M1I12M1I37M1D114M1D43M1D19M1I15M1D45M1D11M1I11M4D29M1I33M2D6M1D3M2I8M1D10M1I15M1D47M1I32M1I7M1I11M2I26M1D45M1D42M1I4M2D9M1I39M2D12M1I16M2I6M72I18M1I17M1D38M1I23M1I9M1I16M1I10M2D3M1D26M1D16M1D19M1D2M1D5M1D17M1D6M1D29M1D20M1I22M2D30M1D5M2D28M1I6M2I19M2D11M2D2M1D17M1D10M1D33M2D10M1D11M1I41M1I7M1I13M1I17M2I31M1D3M2I5M2I5M1D10M1D15M4D8M1I6M1I8M1D4M2I7M1D18M1I23M3I34M1D24M2D24M1I48M1I17M1I27M1I12M1I6M1I10M2I17M1I4M3D11M1I37M2I4M1D14M1D10M1I5M1I5M1D14M1I6M1D18M1I14M1I16M1I9M1I12M3S    *   0   0   *   *   NM:i:269    ms:i:1315   AS:i:1315   nn:i:0  tp:A:S  cm:i:105    s1:i:918    de:f:0.0825 rl:i:0

As you said, there are two records of that read but only one record has the raw read, and that's the read I used to extract the subsequence by the position 2056-2143.

liuqianhn commented 2 years ago

@zhxiaokang I am a little confused by the output here: the two alignments have significant overlap. I remembered that I filter such alignments for supporting gene fusion. If there is no privacy issue, I am happy to have a bam file and test it to see what's wrong.

zhxiaokang commented 2 years ago

Hi @liuqianhn there is indeed some privacy issue, since it's humans' sequencing data. May I have your email and discuss with you there?

majorkazer commented 2 years ago

@liuqianhn - thanks for writing that script to extract the read ids! It works great!