Interpreting output - Githubissues

ajritter8 commented 2 years ago

Hi there,

I was wondering how to interpret/decode the ids in the count table. For example, here are the first 6 transcripts listed:

ENST00000000233_ENSG00000004059
ENST00000000412_0_aligned_48604537_F_721_2756_1314_ENSG00000003056
ENST00000001008_1454_aligned_25668271_R_32_2176_363_12:2803000
ENST00000001008_2_aligned_48341933_F_84_3729_56_ENSG00000004478
ENST00000001008_535_aligned_33029419_R_12_546_13_ENSG00000004478
ENST00000001008_571_aligned_7373688_F_21_963_357_ENSG00000004478

Is there documentation or an explanation for what each underscore-separated string represents? And for measuring the abundance of a transcript (i.e. ENST00000001008), should I sum the counts for each of the ids beginning with "ENST00000001008"?

Lastly, does FLAIR quantify unannotated transcripts, and if so how are they assigned ids?

Jeltje commented 2 years ago

Hey, happy to help. I'm new to the code base myself, so can you please give me the full set of commands that led to your count table, starting with flair align?

ajritter8 commented 2 years ago

Sure (variables are used but hopefully you get the gist):

python flair.py align -g $genome -r $fastqDir/$sample \
  -- secondary=no \
  -t 8 -o $outputDir/$sampleId

python flair.py correct --threads 26 \ 
  --query $sampleId.bed \ 
  --genome $genome \
  --gtf $genomeAnnotation \
  --output $sampleId

python flair.py collapse --threads 26 \ 
  --query ${sampleId}_all_corrected.bed \ 
  --reads $fastqDir/$sampleId.fastq.gz \
  --genome $genome \ 
  --gtf $genomeAnnotation \ 
  --output ${sampleId}_collapsed

python flair.py quantify --threads 26 \ 
  --reads_manifest ${sampleId}_reads_manifest.tsv \ 
  --isoforms ${sampleId}_collapsed.firstpass.fa \
  --tpm \ 
  --output ${sampleId)_flairCounts

Jeltje commented 2 years ago

Thanks, this helps a lot.

Why are you using firstpass.fa as input to quantify instead of isoforms.fa ?

ajritter8 commented 2 years ago

Well when I run Flair using the commands in my last reply, what I get after the “correct” step is $sampleId.firstpass.fa, and nothing else. I assumed “isoforms.fa” was a stand-in for whatever fasta file correct spits out.

On Fri, Jul 22, 2022 at 9:43 AM Jeltje @.***> wrote:

Thanks, this helps a lot.

Why are you using firstpass.fa as input to quantify instead of isoforms.fa ?

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/209#issuecomment-1192756826, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F35LTG467IXYSTD4S5TVVLFUNANCNFSM54FPW5WA . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 2 years ago

firstpass.fa is an intermediate output file from flair collapse, not flair correct, but I think that's what you mean. The final output from flair collapse should be <prefix>isoforms.fa, so that's not a stand-in.

Do you not get a $sampleId_collapsed.isoforms.fa?

Are you running with the latest release? (I'm still working on answering your original question, don't worry)

Jeltje commented 2 years ago

Here is the naming convention used by Flair if you give it a GENCODE annotation:

If the isoform matches a transcript, the name of that transcript and its gene are used, e.g. ENST00000290299.7_ENSG00000241837.7
If the isoform has a splice pattern in common with a gene but doesn't match any of the known transcripts, a randomly selected read name is used in combination with the gene ID: 17a55928-3b95-4d82-8c1f-9d8187389cd5_ENSG00000241837.7
Lastly, if an isoform does not match a gene, a randomly selected read name is combined with the genome location: 87ab9469-20d0-44e9-8d61-6182f919aeb9_chr21:33903000

Untitled

It looks like you are using our test set, which for some reason has read names that look like this:

>ENST00000225792_967_aligned_1879_F_1_2289_9
>ENST00000451605_0_aligned_1880_R_6_516_43
>ENST00000540698_734_aligned_1881_R_8_1302_106
>ENST00000581230_1169_aligned_1882_F_0_1734_3

I understand that is very confusing, I will see if we can do better than that.

Jeltje commented 2 years ago

I am assuming this answered your question. Please reopen this ticket if you need more information.

ajritter8 commented 1 year ago

Great, thanks for that! Weirdly I am not getting an “isoforms.fa” file along the way so I’ll need to look into that. I installed whatever version of Flair I’m using within the last 2 months. Does Flair provide any information other than counts and a randomized ID for unannotated transcripts? I’m hoping to convert novel isoforms IDs to intron strings for comparison with novel isoforms from another long read splicing analysis tool so I need to know how to get the information about novel isoforms out of Flair.

On Fri, Jul 22, 2022 at 1:35 PM Jeltje @.***> wrote:

Here is the naming convention used by Flair if you give it a GENCODE annotation:

If the isoform matches a transcript, the name of that transcript and its gene are used, e.g. ENST00000290299.7_ENSG00000241837.7

If the isoform has a splice pattern in common with a gene but doesn't match any of the known transcripts, a randomly selected read name is used in combination with the gene ID: 17a55928-3b95-4d82-8c1f-9d8187389cd5_ENSG00000241837.7

Lastly, if an isoform does not match a gene, a randomly selected read name is combined with the genome location: 87ab9469-20d0-44e9-8d61-6182f919aeb9_chr21:33903000

[image: Untitled] https://user-images.githubusercontent.com/5482232/180521632-8655623a-0dc0-4efc-80a3-cafd16925d9a.png

I don't think the current version outputs transcripts with the word aligned in them, but let me know if you see different.

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/209#issuecomment-1192916107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F34BHWYEGTUCOMKXHJDVVMAXPANCNFSM54FPW5WA . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 1 year ago

It should give you .bed file outputs. If you don't see these, please open a new ticket.

BrooksLabUCSC / flair

Interpreting output #209