Inconsistent read assignment with bam file

airichli commented 1 week ago

Hi,

Thanks for developing this nice tool. I have met a problem, the read strand is different between .read_assignments.tsv and bam file, this leads to novel gene identification (antisense), but it seems to be from the known gene rather than novel gene, could you please give me some advice?

My command line is (the data is generate by ONT cDNA sequencing, isoquant version is version 3.5.0): isoquant.py -t 1 --reference mm10.chr19.fa --genedb gencode.vM23.annotation.chr19.gtf --fastq test.fastq.gz --model_construction_strategy default_ont --report_novel_unspliced true --data_type nanopore -o test_out --prefix test_out --complete_genedb

In the below example, one is "+", the other is "16", they are inconsistent. Thanks.

In the test_out.read_assignments.tsv file: ff4b5d98-6398-43ee-8b9b-6416e9083ee7 chr19 + ENSMUST00000113533.2 ENSMUSG00000024790.8 inconsistent intron_alternation_novel:6116267-6118343,tss_match_precise:2,correct_polya_site_left:6115998,antisense 6115998-6116266,6118344-6118561 gene_assignment=inconsistent; PolyA=True; Classification=novel_not_in_catalog;

However, in the bam file in aux folder ff4b5d98-6398-43ee-8b9b-6416e9083ee7 16 chr19 6115998 60 106S88M4I26M1I2M1D52M1D38M1D60M2077N160M1D57M135S 0 0 TCACAAGGTTTGAATCTTTGTCCCTCTCACTTGCCTGTCGCTCTATCTTCAGAGGAGAGTCCGCCGCCCGCAAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTGGAGATCATAATGAGTTCCTTTTATTGTTTGGGAATTATGAAAAAGAAGCCAGGTGCTCTGACCCAGGCCTTGTGGGAGACGAGCAAGACCCTCTCAAGCTGCAGAGCCAGGCTCGTGAATGTCACCTTCCTCAGCCATGACCACATCTTCCAGGGTGTGTCCCTGCACTTATTCCCCACTAATATGTGGTAGGCACCAGGGGGTGGAGTCCTTCCTCAGAGTATTGACCCCTCAGGAATACAACTCTGTCTTTATCCAAGGTCAGTCCATGGGTAGTTTGCAGCCGGACATGGTAGGGGTGTGGGTAGAGGACGCACACTCAGCATCGCCGGGCCAGCGGAAGCGCGGCACTGCATCCTGCCGTGGGCTCGAGGGACAGACTCGAGCCCCAGTCTGGCCAAACGCAGCCTGCCGCGGCTTTCTCATTTGCCGCTCCCGCGGCCTTTAGAAACGCCTCCCATAATCCATCGCGCCTGGCTTCCCCAAATGTGAACTCGAATGTTAATCGTAAAGCAATATCAGCACCAACAGAAAGAGAGGACAAAGGTTTCAACGCTTGTCACGGTTACGCCTTCGCCAATTGCTGGATAAAGACCGGTCTCAAGGGTTACATAGC $$&%&&'$#$$%%)&%%&&'&&%%&&%$%%$####%''&''&%&&''5799:9???BBFKIHEABDEMSHJJMINHGFKASRJLGKKGHHHILFHPSLIEHIC??>>=<=<=>E988+(((1*');:666:CEOSSDCDECEECCBEHHJJSLHSNFDEFIDBFIFGA>>IGGGEHGEIGLKS<<;4435***=>CBDDBDBCIHLHKSGCCBDDCDFBFEDFC==>H@C;3323237899::@9854490--.:FJIEDECDGDMLFGGHGGCDAHK+++-786AA??@@/.013-,,,,CC>:78=::;>MQSSSISGKSCDIEEC==<55;<@=@=:866899DHSSB:::;FHPJJGC>=>CIRSHGEDFHMGLOKNOFB@ABHK?812<=GLIKFDCC>HDMNOPCBCCHDBAAF>B856766;<>FHG@?@A?644335=EFCGDFJKGGIKSSHCAC??8766111142229@B?))),('(%%&$$$224BOGIM??>?>7311,/-,,++$%&$%%)4113-@CFSGQLSFHHFBC@?AAFCCB==JSHEE?IE9645=>=>CDB=EHJIKGLSKIJEIJSLGQLMHIGKMKIJSISGI>FH@AFEIHIHSSSSSSSSSSSSSSSGGFKSSSE;>@96--.2==<<9788ACC?CCA@=DCD9333?CGECB?3222333846541-,,,+)(%)+,.011-,%$ NM:i:10 ms:i:459 AS:i:423 nn:i:0 ts:A:- tp:A:P cm:i:125 s1:i:457 s2:i:0 de:f:0.0143 MD:Z:116^T52^A38^G61C158^G57 rl:i:39

andrewprzh commented 2 days ago

Dear @airichli

The strand indicated in read assignments and BAM files are two different strands and might not be equal in general.

The BAM file indicates mapping strand, i.e. whether a read was reverse-complemented during the alignment. Since ONT cDNA data is non-strand-specific (correct me if you use some specific protocol), strands in BAM files are distributed arbitrary ~50/50. Basically half of the reads are sequenced from cDNA with the same strand as the reference genome, half - from the reverse one.

Read assignments simple indicate the strand of the matching gene / isoform.

However, in your particular case read assignments also tell "antisense". This means the read was assigned to a + stand gene, but other properties, such as polyA or splice junctions tell that the strand is negative. I'll take a look at this case, thanks for the report. Could you also send me the resulting novel antisense gene/transcript?

Best Andrey

airichli commented 2 days ago

Thanks Andrey for your kind assistance. testdata.zip

I attach the testdata.zip, including 1) the resulting novel antisense gene/transcript(mouse_allsamples_chr19.transcript_models.gtf.ENSMUST00000113533.2) 2) read assigned to the transcript from the novel gene (mouse_allsamples_chr19.transcript_model_reads.tsv.transcript19671.chr19.nnic), including the test read ff4b5d98-6398-43ee-8b9b-6416e9083ee7 3) 1 test read fastq and the isoquant result runing with just 1 read (ff4b5d98-6398-43ee-8b9b-6416e9083ee7)

I checked the read ff4b5d98-6398-43ee-8b9b-6416e9083ee7 in IGV, and it should be derived from the known gene rather than antisense novel gene, but isoquant just say it is antisense.

Thanks again and looking forward to hearing from you.

ablab / IsoQuant

Inconsistent read assignment with bam file #250