BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
205 stars 71 forks source link

predictProductivity seems incompatible with rvolden/Mandalorion-derived GTF #240

Closed ajritter8 closed 1 year ago

ajritter8 commented 1 year ago

Command: predictProductivity --longestORF -i Data/Mandalorion/Isoforms.filtered.clean.bed -f RESOURCES/GRCh38.primary_assembly.genome.fa -g Data/Mandalorion/Isoforms.filtered.clean.gtf > flair_isoform_productivity_prediction_longestORF.bed

I installed FLAIR using bioconda.

  1. bioconda (e.g. conda create -n flair -c conda-forge -c bioconda flair)
***** WARNING: File /tmp/pybedtools.bxcv0mfq.tmp has inconsistent naming convention for record:
chr1    68488757    68489034    Isoform10016_4  1000    -

***** WARNING: File /tmp/pybedtools.bxcv0mfq.tmp has inconsistent naming convention for record:
chr1    68488757    68489034    Isoform10016_4  1000    -

Other notes: When I examine the output, all isoforms are marked NGO. I also tried using gencode.v41.primary_assembly.gtf in place of the Mandalorion GTF just to see what happened and the output yielded a variety of productivity predictions more in line with what I would expect. However, I doubt the output is correct given that I used an annotation that doesn't reflect the isoforms BED file.

The Mandalorion GTF does look different from a more standard GTF, here are the first 10 lines of it for example: GL000195.1 hg38 transcript 46159 86663 . - . transcript_id "Isoform10_10"; gene_id "Isoform10_10.gene"; gene_name "Isoform10_10" GL000195.1 hg38 exon 46159 46801 . - . transcript_id "Isoform10_10"; gene_id "Isoform10_10.gene"; gene_name "Isoform10_10" GL000195.1 hg38 exon 48955 49119 . - . transcript_id "Isoform10_10"; gene_id "Isoform10_10.gene"; gene_name "Isoform10_10" GL000195.1 hg38 exon 74121 74180 . - . transcript_id "Isoform10_10"; gene_id "Isoform10_10.gene"; gene_name "Isoform10_10" GL000195.1 hg38 exon 86518 86663 . - . transcript_id "Isoform10_10"; gene_id "Isoform10_10.gene"; gene_name "Isoform10_10" GL000195.1 hg38 transcript 45413 86663 . - . transcript_id "Isoform11_15"; gene_id "Isoform11_15.gene"; gene_name "Isoform11_15" GL000195.1 hg38 exon 45413 46801 . - . transcript_id "Isoform11_15"; gene_id "Isoform11_15.gene"; gene_name "Isoform11_15" GL000195.1 hg38 exon 48955 49119 . - . transcript_id "Isoform11_15"; gene_id "Isoform11_15.gene"; gene_name "Isoform11_15" GL000195.1 hg38 exon 86518 86663 . - . transcript_id "Isoform11_15"; gene_id "Isoform11_15.gene"; gene_name "Isoform11_15" GL000195.1 hg38 transcript 45941 86666 . - . transcript_id "Isoform12_38"; gene_id "Isoform12_38.gene"; gene_name "Isoform12_38"

Can we think of anything to do to make the Mandalorion GTF compatible with FLAIR's predictProductivity module?

Jeltje commented 1 year ago

I think you may need to get rid of the gene name part of that GTF, for example like so cut -f1,2 -d';' Isoforms.filtered.clean.gtf > Isoforms.filtered.cleaner.gtf Alternatively you might be able to run gtf_to_bed and then bed_to_gtf.py

The inconsistent naming issue is usually a bedtools version problem, which you shouldn't have if you installed using conda. What do you get from bedtools --version ?

ajritter8 commented 1 year ago

I tried getting rid of the gene name part of the GTF with both methods you suggested and re-ran predictProductivity; unfortunately all isoforms were called NGO just like I encountered initially. My bedtools version is v2.30.0.

Thanks,

Alex

On Tue, Feb 7, 2023 at 12:10 PM Jeltje @.***> wrote:

I think you may need to get rid of the gene name part of that GTF, for example like so cut -f1,2 -d';' Isoforms.filtered.clean.gtf > Isoforms.filtered.cleaner.gtf Alternatively you might be able to run gtf_to_bed and then bed_to_gtf.py https://raw.githubusercontent.com/BrooksLabUCSC/flair/master/src/flair/bed_to_gtf.py

The inconsistent naming issue is usually a bedtools version problem, which you shouldn't have if you installed using conda. What do you get from bedtools --version ?

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/240#issuecomment-1421378389, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F36T5D5JQBKGEV3U7WLWWKT5FANCNFSM6AAAAAAUG3Q4N4 . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 1 year ago

I think I see the problem.

PredictProductivity tries to find ORFs in the isoform bed file using start codon locations in the annotation file. It looks like your Mandalorion file does not have annotated coding sequences, so it can't be used in this way.

I'll add a warning to the program if no start codons are found at all in the annotation.

ajritter8 commented 1 year ago

That makes sense. In that case, if I provide an isoform BED file but with a standard GENCODE GTF file that doesn't correspond directly to the isoform BED file, will it assess the productivity of the isoforms according to how they overlap with annotated transcript CDS? I've tried this and did get a variety of productivity classes for the isoforms but I'm not sure whether or not the results are accurate.

Best,

Alex

On Tue, Feb 7, 2023 at 6:39 PM Jeltje @.***> wrote:

I think I see the problem.

PredictProductivity tries to find ORFs in the isoform bed file using start codon locations in the annotation file. It looks like your Mandalorion file does not have annotated coding sequences, so it can't be used in this way.

I'll add a warning to the program if no start codons are found at all in the annotation.

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/240#issuecomment-1421896896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F37XRQAPZRTUPB2P53DWWMBORANCNFSM6AAAAAAUG3Q4N4 . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 1 year ago

From looking at the code the annotation is only used for finding the start codons. For each isoform the program then checks if it overlaps any start codons. If it does not (as in your case), it immediatly returns NGO. If it does find a start codon, it uses that to phase the isoform sequence and checks for in-frame stop codons. If there are multiple starts, it does this for each one of them.

So if your isoforms overlap a start codon in the annotation this should work fine, as long as they're not 5' extensions of existing ORFs or alternative (unknown) first coding exons.

ajritter8 commented 1 year ago

Great, that’s very helpful. Thank you!

On Thu, Feb 9, 2023 at 12:33 PM Jeltje @.***> wrote:

From looking at the code the annotation is only used for finding the start codons. For each isoform the program then checks if it overlaps any start codons. If it does not (as in your case), it immediatly returns NGO. If it does find a start codon, it uses that to phase the isoform sequence and checks for in-frame stop codons. If there are multiple starts, it does this for each one of them.

So if your isoforms overlap a start codon in the annotation this should work fine, as long as they're not 5' extensions of existing ORFs or alternative (unknown) first coding exons.

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/240#issuecomment-1424776586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALT2F3YAYT6BV4OSPARAJ7LWWVIB3ANCNFSM6AAAAAAUG3Q4N4 . You are receiving this because you authored the thread.Message ID: @.***>