Closed bheavner closed 7 years ago
Looks like it arises earlier than define_boundaries():
sum(is.na(genes$gene_id))
#[1] 9832
but it looks like there are na's in the imported gtf file, too:
sum(is.na(gtf$gene_id[gtf$feature == "gene"]))
#[1] 9832
Do they exist in /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz ?
Need to parse args to figure out.. but..
raw <- genetable:::.read_gtf(path)
intermediate <- genetable:::.pull_required(raw)
sum(is.na(intermediate$gene_id))
#[1] 10863
sum(is.na(intermediate$gene_id[intermediate$feature == "gene"]))
#[1] 9832
which suggests it may be due to the regular expression in genetable:::.pull_required()
, or perhaps somehow something in genetable:::.read_gtf()
(but that seems less likely)
idea: look closely at intermediate$start for an entry with is.na(intermediate$gene_id) and intermediate$feature == "gene":
intermediate$start[intermediate$feature == "gene" & is.na(intermediate$gene_id)]
[1] 134901 157784 450820 693613 738532 818043 861264 1102484 1103243 1104385 1340841 1497726 1510355 1515136
So perhaps a problem when there's multiple matches?
raw[raw$start == 134901, ]
# A tibble: 4 × 9
seqname source feature start end score strand frame
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr>
1 chr1 ENSEMBL gene 134901 139379 . - .
2 chr1 ENSEMBL transcript 134901 139379 . - .
3 chr1 ENSEMBL exon 134901 135802 . - .
4 chr1 ENSEMBL UTR 134901 135802 . - .
# ... with 1 more variables: attribute <chr>
> View(raw[raw$start == 134901, ])
> raw$attribute[raw$start == 134901]
[1] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENSG00000237683.5\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1\"; level 3;"
[2] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"
[3] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; exon_number 2; exon_id \"ENSE00002314092.1\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"
[4] "gene_id \"ENSG00000237683.5\"; transcript_id \"ENST00000423372.3\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1-201\"; level 3; protein_id \"ENSP00000473460.1\"; tag \"basic\"; tag \"appris_principal\";"
Note that raw returns a 4x9 tibble; and intermediate also has a tibble - but with the first gene_ID == NA:
intermediate[intermediate$start == 134901,]
A tibble: 4 × 18
seqname source feature start end score strand frame gene_id transcript_id gene_type gene_name transcript_type
<chr> <chr> <chr> <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 chr1 ENSEMBL gene 134901 139379 . - . <NA> <NA> <NA> <NA> <NA>
2 chr1 ENSEMBL transcript 134901 139379 . - . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1 protein_coding
3 chr1 ENSEMBL exon 134901 135802 . - . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1 protein_coding
4 chr1 ENSEMBL UTR 134901 135802 . - . ENSG00000237683.5 ENST00000423372.3 protein_coding AL627309.1 protein_coding
Looks like the space at the end of "level ([123]); ", # required in gtf on line 86 of https://github.com/UW-GAC/genetable/blob/develop/R/import-gencode.r is causing the mismatch (level 3 is the end of the line, no space after).
I don't see a problem with just deleting that space - I don't think .parse_optional() will be affected.
troubleshooting:
sample = as_data_frame('gene_id \"ENSG00000237683.5\"; transcript_id \"ENSG00000237683.5\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"AL627309.1\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"AL627309.1\"; level 3;')
rx <- paste('gene_id "(.+?)"; ', # required gene_id field in gtf
'transcript_id "(.+?)"; ', # required in gtf
'gene_type "(.+?)";', # required in gtf
".*", # gene_status removed from newer releases
'gene_name "(.+?)"; ', # required in gtf
'transcript_type "(.+?)";', # required in gtf
".*", # transcript_status removed from newer releases
'transcript_name "(.+?)"; ', # required in gtf
"(?:exon_number (.+?); )?", # exon_number (not in all lines)
'(?:exon_id "(.+?)"; )?', # exon_id (not in all lines)
"level ([123]); ", # required in gtf
"(.*)$", # additional optional fields
sep = "") # join regex string without spaces
stringr::str_view(sample$value, rx)
for comparison, Deepti's table: