Closed ajritter8 closed 2 years ago
Hey, happy to help. I'm new to the code base myself, so can you please give me the full set of commands that led to your count table, starting with flair align
?
Sure (variables are used but hopefully you get the gist):
python flair.py align -g $genome -r $fastqDir/$sample \
-- secondary=no \
-t 8 -o $outputDir/$sampleId
python flair.py correct --threads 26 \
--query $sampleId.bed \
--genome $genome \
--gtf $genomeAnnotation \
--output $sampleId
python flair.py collapse --threads 26 \
--query ${sampleId}_all_corrected.bed \
--reads $fastqDir/$sampleId.fastq.gz \
--genome $genome \
--gtf $genomeAnnotation \
--output ${sampleId}_collapsed
python flair.py quantify --threads 26 \
--reads_manifest ${sampleId}_reads_manifest.tsv \
--isoforms ${sampleId}_collapsed.firstpass.fa \
--tpm \
--output ${sampleId)_flairCounts
Thanks, this helps a lot.
Why are you using firstpass.fa
as input to quantify
instead of isoforms.fa
?
Well when I run Flair using the commands in my last reply, what I get after the “correct” step is $sampleId.firstpass.fa, and nothing else. I assumed “isoforms.fa” was a stand-in for whatever fasta file correct spits out.
On Fri, Jul 22, 2022 at 9:43 AM Jeltje @.***> wrote:
Thanks, this helps a lot.
Why are you using firstpass.fa as input to quantify instead of isoforms.fa ?
— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/209#issuecomment-1192756826, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F35LTG467IXYSTD4S5TVVLFUNANCNFSM54FPW5WA . You are receiving this because you authored the thread.Message ID: @.***>
firstpass.fa
is an intermediate output file from flair collapse
, not flair correct
, but I think that's what you mean.
The final output from flair collapse
should be <prefix>isoforms.fa
, so that's not a stand-in.
Do you not get a $sampleId_collapsed.isoforms.fa
?
Are you running with the latest release? (I'm still working on answering your original question, don't worry)
Here is the naming convention used by Flair if you give it a GENCODE annotation:
ENST00000290299.7_ENSG00000241837.7
17a55928-3b95-4d82-8c1f-9d8187389cd5_ENSG00000241837.7
87ab9469-20d0-44e9-8d61-6182f919aeb9_chr21:33903000
It looks like you are using our test set, which for some reason has read names that look like this:
>ENST00000225792_967_aligned_1879_F_1_2289_9
>ENST00000451605_0_aligned_1880_R_6_516_43
>ENST00000540698_734_aligned_1881_R_8_1302_106
>ENST00000581230_1169_aligned_1882_F_0_1734_3
I understand that is very confusing, I will see if we can do better than that.
I am assuming this answered your question. Please reopen this ticket if you need more information.
Great, thanks for that! Weirdly I am not getting an “isoforms.fa” file along the way so I’ll need to look into that. I installed whatever version of Flair I’m using within the last 2 months. Does Flair provide any information other than counts and a randomized ID for unannotated transcripts? I’m hoping to convert novel isoforms IDs to intron strings for comparison with novel isoforms from another long read splicing analysis tool so I need to know how to get the information about novel isoforms out of Flair.
On Fri, Jul 22, 2022 at 1:35 PM Jeltje @.***> wrote:
Here is the naming convention used by Flair if you give it a GENCODE annotation:
- If the isoform matches a transcript, the name of that transcript and its gene are used, e.g. ENST00000290299.7_ENSG00000241837.7
- If the isoform has a splice pattern in common with a gene but doesn't match any of the known transcripts, a randomly selected read name is used in combination with the gene ID: 17a55928-3b95-4d82-8c1f-9d8187389cd5_ENSG00000241837.7
- Lastly, if an isoform does not match a gene, a randomly selected read name is combined with the genome location: 87ab9469-20d0-44e9-8d61-6182f919aeb9_chr21:33903000
[image: Untitled] https://user-images.githubusercontent.com/5482232/180521632-8655623a-0dc0-4efc-80a3-cafd16925d9a.png
I don't think the current version outputs transcripts with the word aligned in them, but let me know if you see different.
— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/209#issuecomment-1192916107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F34BHWYEGTUCOMKXHJDVVMAXPANCNFSM54FPW5WA . You are receiving this because you authored the thread.Message ID: @.***>
It should give you .bed file outputs. If you don't see these, please open a new ticket.
Hi there,
I was wondering how to interpret/decode the ids in the count table. For example, here are the first 6 transcripts listed:
Is there documentation or an explanation for what each underscore-separated string represents? And for measuring the abundance of a transcript (i.e. ENST00000001008), should I sum the counts for each of the ids beginning with "ENST00000001008"?
Lastly, does FLAIR quantify unannotated transcripts, and if so how are they assigned ids?