Understanding novel mode outputs

spvensko commented 2 months ago

I'm trying to better understand the output from the novel mode. The description from the usage states:

The output is a GTF file with a list of genes/isoforms and their relative abundance. However, while it appears to somewhat follow GTF conventions, I'm not sure how to interpret it.

Example output:

head HCC1395-Pt01-ar-711.liqa_novel_isos 
GeneName        IsoformName     ReadPerGene_corrected   RelativeAbundance       infor_ratio
AC090197.1      chr8    +       23336170        23366125        ENST00000519692.1,ENST00000517420.1,
AC090197.1      chr8    +       23336170        23336206        0,1,
AC090197.1      chr8    +       23336207        23337457        1,1,
AC090197.1      chr8    +       23341025        23341149        1,1,
AC090197.1      chr8    +       23341150        23341536        0,1,
AC090197.1      chr8    +       23363125        23366125        1,0,
LINC02774       chr1    +       243917401       244047317       ENST00000440494.1,ENST00000652928.1,
LINC02774       chr1    +       243917401       243917449       1,0,
LINC02774       chr1    +       243917450       243917526       1,1,

Let's consider the AC090197.1 portion specifically -- it's clear the first line is a "header" line for the gene which lists the gene, chromosome, strand, the start position of all possible splice variants, the end position of all possible splice variants, and the canonical transcripts that are observed with these splice variants.

So, for AC090197.1 chr8 + 23336170 23336206 0,1,, I believe it's stating that there is an included region at chr8:23336170-23336206 that is observed in ENST00000517420.1, but not ENST00000519692.1.

My questions:

Am I interpreting the output correctly?
How would situations of missing sequence (e.g. skipped exons) be represented in this format?

spvensko commented 2 months ago

Actually, it appears the contents of the novel mode are the contents of the refgene reference with the header from the quantify mode:

[spvensko@c6145-2-8 d93dc5a2a20de44d772037ba700622]$ head *refgene*
AC090197.1      chr8    +       23336170        23366125        ENST00000519692.1,ENST00000517420.1,
AC090197.1      chr8    +       23336170        23336206        0,1,
AC090197.1      chr8    +       23336207        23337457        1,1,
AC090197.1      chr8    +       23341025        23341149        1,1,
AC090197.1      chr8    +       23341150        23341536        0,1,
AC090197.1      chr8    +       23363125        23366125        1,0,
LINC02774       chr1    +       243917401       244047317       ENST00000440494.1,ENST00000652928.1,
LINC02774       chr1    +       243917401       243917449       1,0,
LINC02774       chr1    +       243917450       243917526       1,1,
LINC02774       chr1    +       243978409       243978519       1,1,
[spvensko@c6145-2-8 d93dc5a2a20de44d772037ba700622]$ head *novel*
GeneName        IsoformName     ReadPerGene_corrected   RelativeAbundance       infor_ratio
AC090197.1      chr8    +       23336170        23366125        ENST00000519692.1,ENST00000517420.1,
AC090197.1      chr8    +       23336170        23336206        0,1,
AC090197.1      chr8    +       23336207        23337457        1,1,
AC090197.1      chr8    +       23341025        23341149        1,1,
AC090197.1      chr8    +       23341150        23341536        0,1,
AC090197.1      chr8    +       23363125        23366125        1,0,
LINC02774       chr1    +       243917401       244047317       ENST00000440494.1,ENST00000652928.1,
LINC02774       chr1    +       243917401       243917449       1,0,
LINC02774       chr1    +       243917450       243917526       1,1,

The tool is still running, so perhaps the file will be further modified prior to completion?

Here's how I'm running novel mode:

liqa -task novel -refgene gencode.v37.annotation.with.hervs.gtf.refgene -bam HCC1395-Pt01-ar-711.sorted.bam -out HCC1395-Pt01-ar-711.liqa_novel_isos -num_cover 20 -num_support_read 10

agouru55 commented 2 months ago

Yes, this is the correct interpretation of the results, and will address the issue with the header from issue #42 - thanks for pointing that out! The output should be different from the refgene file based on coverage/read parameters that you have provided. Are you still having an issue with that?

WGLab / LIQA

Understanding novel mode outputs #43