Open dasmoth opened 10 years ago
@dasmoth Thanks for highlighting the issue.
I think both of the solutions you propose are very reasonable. I'm happy to change the code and disable the translation for the transcripts with the cds_start_NF tag if you let me know when you update the GENCODE files.
Filtering out cds_start_NF is done now. There's a suitable human gene set at http://www.biodalliance.org/datasets/GRCh37/gencode.v19.annotation.bb -- I'll make that the default for the public browsers as well.
But will leave this issue open while we consider using bigBeds with phase information baked in.
The master branch now has experimental support for bigBeds with explicit exon frames, encoded in the same way as the UCSC browser "genePredExt" tables.
There's an example of such a bigBed at http://www.biodalliance.org/datasets/gencode19-explicitFrames.bb for anyone who wants to test this out.
@dasmoth is there some documentation on how to generate and configure these bed files? Is there a bare BED file corresponding to the .bb file so I can see how the features are formatted? is there some codepen or so, that configures the dalliance source properly?
Biodalliance should auto-detect these.
The format matches the UCSC bigGenePred format: https://genome.ucsc.edu/goldenpath/help/bigGenePred.html
(CC @ymen. Thanks to @timjph for initially bringing this to my attention).
The new translation code is sometimes generating incorrect (or at least improbable) translations in cases where gene annotation has been truncated. Here's an example:
In this case, ENST00000415100.1 is 5' truncated and the start of the annotated coding sequence isn't in-frame, and therefore the incorrect translation is shown.
Quick solution: for gene Bigbed files created by GENCODE, any transcripts with this issue should have the transcript attribute "cds_start_NF". I think it would be worth disabling translations for such transcripts for now. (Related issue: there are some slightly old GENCODE bigBed files on the biodalliance.org site, which don't include transcript attributes. I'll update these soon)
Longer term solution: prefer formats with explicit per-exon phase information. GFF2/GTF has this, so switching to GTF+Tabix as our preferred gene format would be one option. However, bigBed files are compact and very convenient! So I'm going to suggest adding an extra optional column containing a comma-separated list of per-exons phases. The UCSC GENCODE tables (which are otherwise quite similar in format to BED12 files) include just such a column.