Translate predictProductivity output to protein

andreas-wilm commented 9 months ago

I have read the paper (https://doi.org/10.1038/s41467-020-15171-6)
and the manual (https://flair.readthedocs.io/en/latest/) and I still have a question about: predictProductivity

Hi there, thanks for making Flair available!

I would like to translate predictions made by Flair to protein. For this I went through the entire flow, used predictProductivity as "ORF-finding script" (as also mentioned in https://github.com/BrooksLabUCSC/flair/issues/130#issuecomment-689738625) and focused only on the productive (PRO) entries. I'm unsure how to interpret the output though. When I clip sequences to the relative positions of column 7 and 8 ("thickStart" & "thickend") many sequences do start with a start codon, but many don't. I also often see cases where the derived start position is longer than the sequence found in the collapsed isoform fasta. I've also tried to interpret the listed blocks, with no success. So I guess I must be doing something fundamentally wrong.

To summarize this: What's the best way to translate the PRO output from predictProductivity to protein?

Many thanks

Jeltje commented 9 months ago

It's probably some issue with the predictProductivity code. To debug it I'll need some examples of predictions going wrong. Would you be able to attach some examples? I'll need the annotation, isoforms bed, and pedictProductivity output file, preferably only of a few example regions.

andreas-wilm commented 9 months ago

Thanks for replying @Jeltje. I picked 10 entries at random and attached the corresponding files here.

While putting this together I started wondering if I interpreted the output correctly. Can I just take the bed output of predictProductivity, extract the corresponding isoforms from the isoforms.fa file and use them as-is, or do I need to apply the bed coordinates (the blocks for example) to extra the correct blocks from the sequence in isoforms.fa?

Many thanks, Andreas

github_flair_299.tgz

EDIT: This is based on gencode.v41.annotation.gtf and GRCh38's primary_assembly

Jeltje commented 9 months ago

You should be able to use the isoforms.fa file as-is. If you want to translate those sequences into open reading frames, there are a number of tools available online to do so.

The bed files are in genome coordinates. If you take the file you sent me and upload it to the UCSC genome browser (look for My Data at the top, then select User Tracks), you can see that the start codons all match up with either an existing start codon or an ATG in the genome.

All of the genes are single exon and overlap existing annotations, so they are probably artefacts. To avoid these, use the --no_redundant longest setting in flair collapse. This obviously only works well if you have good coverage of the whole gene.

It ought to be trivial to take a bed file and a genome file, then get the peptides as indicated by the cds coordinates but I can't quickly seem to find one.

If this answers your question, please close this ticket.

andreas-wilm commented 9 months ago

Thanks for the answer!

BrooksLabUCSC / flair

Translate predictProductivity output to protein #299