Compress all human gene structures into size of smartphone photo

This lets you create a small file with gene structures for all human transcripts. That unlocks biological insight, like uncommon but important transcripts for ACE2, a gene key for SARS-CoV-2 (COVID-19) infection.

Introduction

Previously, only the single canonical transcript of each protein-coding gene was provided. Among a gene's transcripts, its canonical transcript generally has the highest coverage of conserved exons, highest expression, longest coding sequence and is represented across key bioinformatics resources. The other, non-canonical transcripts are often called "splice variants" or "isoforms".

The additional transcripts represent how the gene can be alternatively spliced. That is, they represent how one gene can be spliced in any of multiple different mRNAs -- how one gene can encode many slightly different proteins. This difference in proteins derives from the different arrangement of transcript subparts (coding sequences / exons, introns, and UTRs). These different transcripts for the same gene often have different expression patterns in different tissues and cell types.

Results

This makes almost 103k transcripts available, up from almost 20k previously. That's 5 transcripts per protein-coding gene.

Compression

The expanded dataset is heavily compressed, using new and custom space-saving optimizations. Before the efficiency gains in this work, all human protein-coding transcripts were 4.8 MB when compressed via gzip, and 15 MB raw. Now, all such transcripts losslessly compress to 2.5 MB gzipped and 6.0 MB raw. That's almost 2x smaller gzipped, and 2.5x smaller raw.

For comparison, the previous canonical-only transcript dataset was 1.2 MB gzipped, 2.9 MB raw. So this work gives 5x more data for only 2x larger size.

In other words, this compression packs the full human protein-coding transcriptome into the size of a smartphone photo.

Impact

Uncommon transcripts can be important! This data is needed to see them in Ideogram.

For example, a non-canonical transcript of ACE2 encodes "short ACE2". The canonical transcript of ACE2, known as "long ACE2", encodes a protein that SAR-CoV-2 -- the COVID-19 virus -- uses to enter cells. By contrast, the non-canonical short ACE2 transcript is:

predominantly expressed in differentiated airway epithelial cells, especially in cells of the upper airways, which are the main site of SARS-CoV-2 infection. Our data suggest that the transcript encodes a 52-kDa protein that can be detected in airway epithelial cells, although, because it lacks a signal peptide, it may be a relatively unstable protein. We demonstrate that it is this isoform, rather than full-length ACE2, that is IFN regulated and inducible on RV infection. However, in conditions of IFN suppression, as observed during SARS-CoV-2 infection, or IFN-β deficiency, as in asthma, short ACE2 is not induced to the same degree as normal. Although the function of short ACE2 is unknown, its regulation by IFN suggests that it may play an essential role in innate antiviral defense mechanisms in the airways.

Another example of the importance of non-canonical transcripts is TP53, a gene mutated in most cancers. Which TP53 isoform is expressed helps assess prognosis of different cancer types:

p53 isoforms cannot be categorized into oncogenic or tumor-suppressor classes, since their biological activities and thus their prognostic value are associated with the cell context. Indeed, Δ133p53α expression is associated with cancer formation and progression in cholangiocarcinoma, as well as in colon and gastric cancers, while Δ133p53α is associated with a lower risk of cancer recurrence and death in mutant p53 serous ovarian cancer. Similarly, p53β is associated with clinicopathological markers of good prognosis in colon cancer and AML, while p53β is associated with worse recurrence-free survival in serous ovarian cancer patients exhibiting functionally active p53.

eweitz / ideogram