When aggregating by gene, define_boundaries() has many NA gene_ids

bheavner commented 7 years ago

library(genetable)
library(tidyverse)

path <- "/projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz"

# import the gtf file to a tidy data frame (a tibble)
# this is slow
gtf <- import_gencode(path)

# filter for features == "gene"
genes <- filter_gencode(gtf, featurearg = "gene")

# define the boundaries of the feature of interest
gene_bounds <- define_boundaries(genes, "gene_id")

dim(genes)
#[1] 57820    30

dim(gene_bounds)
[1] 48013    10

glimpse(gene_bounds[gene_bounds$merge_count > 1, ])
# Observations: 25
# Variables: 10
# $ chr             <chr> "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15"...
# $ strand          <chr> "-", "-", "-", "+", "-", "+", "+", "+", "-", "-", "-", "-", "-", "+", "+", "+", "-", "-", "-", "-", "+", "-", "-", "+", "+"
# $ gene_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
# $ gene_name       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
# $ agg_start       <int> 134901, 3020991, 324372, 65017, 144202, 1080164, 302918, 1765397, 507258, 200771, 152999, 73185, 19443325, 19119515, 2008886...
# $ agg_end         <int> 249206994, 242524490, 197923720, 190981630, 180878968, 170639932, 158384388, 146078907, 141150148, 135473179, 134856693, 133...
# $ source          <chr> "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "ENSEMBL", "EN...
# $ transcript_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
# $ merge_count     <int> 911, 759, 597, 467, 476, 487, 498, 401, 409, 412, 434, 470, 251, 411, 449, 331, 416, 208, 338, 264, 115, 169, 467, 55, 37
# $ agg_size        <int> 249072093, 239503499, 197599348, 190916613, 180734766, 169559768, 158081470, 144313510, 140642890, 135272408, 134703694, 133...

for comparison, Deepti's table:

genedefs_20170503 <- readr::read_tsv("/projects/topmed/gac_data/aggregation_units/gene_based/gencode_v19/gencode_v19_genes.txt")

dim(genedefs_20170503)
#[1] 57820     7

genedefs_20170503[is.na(genedefs_20170503$gene_id),]
# A tibble: 0 × 7

bheavner commented 7 years ago

Looks like it arises earlier than define_boundaries():

sum(is.na(genes$gene_id))
#[1] 9832

bheavner commented 7 years ago

but it looks like there are na's in the imported gtf file, too:

sum(is.na(gtf$gene_id[gtf$feature == "gene"]))
#[1] 9832

Do they exist in /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz ?

Need to parse args to figure out.. but..

raw <- genetable:::.read_gtf(path)
intermediate <- genetable:::.pull_required(raw)
sum(is.na(intermediate$gene_id))
#[1] 10863

sum(is.na(intermediate$gene_id[intermediate$feature == "gene"]))
#[1] 9832

which suggests it may be due to the regular expression in genetable:::.pull_required(), or perhaps somehow something in genetable:::.read_gtf() (but that seems less likely)

bheavner commented 7 years ago

idea: look closely at intermediate$start for an entry with is.na(intermediate$gene_id) and intermediate$feature == "gene":

intermediate$start[intermediate$feature == "gene" & is.na(intermediate$gene_id)]
[1]    134901    157784    450820    693613    738532    818043    861264   1102484   1103243   1104385   1340841   1497726   1510355   1515136

So perhaps a problem when there's multiple matches?

raw[raw$start ==  134901, ]
# A tibble: 4 × 9
  seqname  source    feature  start    end score strand frame
    <chr>   <chr>      <chr>  <int>  <int> <chr>  <chr> <chr>
1    chr1 ENSEMBL       gene 134901 139379     .      -     .
2    chr1 ENSEMBL transcript 134901 139379     .      -     .
3    chr1 ENSEMBL       exon 134901 135802     .      -     .
4    chr1 ENSEMBL        UTR 134901 135802     .      -     .
# ... with 1 more variables: attribute <chr>
> View(raw[raw$start ==  134901, ])
> raw$attribute[raw$start ==  134901]
[1] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENSG00000237683.5\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1\"; level 3;"                                                                                                                             
[2] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"                                              
[3] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; exon_number 2; exon_id \"ENSE00002314092.1\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"
[4] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"

Note that raw returns a 4x9 tibble; and intermediate also has a tibble - but with the first gene_ID == NA:

intermediate[intermediate$start == 134901,]
A tibble: 4 × 18
  seqname  source    feature  start    end score strand frame           gene_id     transcript_id      gene_type  gene_name transcript_type
    <chr>   <chr>      <chr>  <int>  <int> <chr>  <chr> <chr>             <chr>             <chr>          <chr>      <chr>           <chr>
1    chr1 ENSEMBL       gene 134901 139379     .      -     .              <NA>              <NA>           <NA>       <NA>            <NA>
2    chr1 ENSEMBL transcript 134901 139379     .      -     . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1  protein_coding
3    chr1 ENSEMBL       exon 134901 135802     .      -     . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1  protein_coding
4    chr1 ENSEMBL        UTR 134901 135802     .      -     . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1  protein_coding

bheavner commented 7 years ago

Looks like the space at the end of "level ([123]); ", # required in gtf on line 86 of https://github.com/UW-GAC/genetable/blob/develop/R/import-gencode.r is causing the mismatch (level 3 is the end of the line, no space after).

I don't see a problem with just deleting that space - I don't think .parse_optional() will be affected.

bheavner commented 7 years ago

troubleshooting:

sample = as_data_frame('gene_id \"ENSG00000237683.5\"; transcript_id \"ENSG00000237683.5\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1\"; level 3;')

rx <- paste('gene_id "(.+?)"; ', # required gene_id field in gtf
                'transcript_id "(.+?)"; ', # required in gtf
                'gene_type "(.+?)";', # required in gtf
                ".*", # gene_status removed from newer releases
                'gene_name "(.+?)"; ', # required in gtf
                'transcript_type "(.+?)";', # required in gtf
                ".*", # transcript_status removed from newer releases
                'transcript_name "(.+?)"; ', # required in gtf
                "(?:exon_number (.+?); )?", # exon_number (not in all lines)
                '(?:exon_id "(.+?)"; )?', # exon_id (not in all lines)
                "level ([123]); ", # required in gtf
                "(.*)$", # additional optional fields
sep = "") # join regex string without spaces

stringr::str_view(sample$value, rx)

UW-GAC / genetable

When aggregating by gene, define_boundaries() has many NA gene_ids #22