Closed bheavner closed 7 years ago
Looks like the problem is with my import - there can be multiple tag fields, but I'm using this regular expression:
rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
'.*; gene_name "(.+?)"; transcript_type "(.+?)";',
'transcript_status "(.+?)";',
'transcript_name "(.+?)";( .*; tag "(.+?)";)*', sep = " ")
parsed_gtf <- gtf %>%
tidyr::extract(attribute,
c("gene_id",
"transcript_id",
"gene_type",
"gene_name",
"transcript_type",
"transcript_status",
"transcript_name",
"extra",
"tag"),
rx, remove = FALSE) %>%
dplyr::select(-attribute, -extra)
For troubleshooting:
sample = 'chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";'
rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
'.*; gene_name "(.+?)"; transcript_type "(.+?)";',
'transcript_status "(.+?)";',
'transcript_name "(.+?)"; .* (tag "(.+?)";)*', sep = " ")
tidyr::extract(as_data_frame(sample), value, c("gene_id",
"transcript_id",
"gene_type",
"gene_name",
"transcript_type",
"transcript_status",
"transcript_name",
"extra",
"tag"),
rx)
If I extract multiple tags with the regex, I'm still going to have to figure out how to turn that into a list for tidyr::extract, or to parse it... :( Maybe just end up doing the tag basic after all. :(
This works on regexr, and gives 4 matches on the sample: (?:tag \\"([^"]+))";
(or in R, '(?:tag "([^"]+)")'
, but tidyr::extract(as_data_frame(sample), value, "into", regex = '(?:tag "([^"]+)")' )
only returns the first match...
maybe just do a "is basic" column using regex = '(?:tag "basic")'
?
sample = 'chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";'
rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
'.*; gene_name "(.+?)"; transcript_type "(.+?)";',
'transcript_status "(.+?)";',
'transcript_name "(.+?)"; .* tag "(basic)"', sep = " ")
tidyr::extract(as_data_frame(sample), value, c("gene_id",
"transcript_id",
"gene_type",
"gene_name",
"transcript_type",
"transcript_status",
"transcript_name",
"tag"),
rx)
I'm working on this in ~/script_backup.R
After I do this:
The bounds object has 41614 observations, and the file at
/projects/topmed/gac_data/aggregation_units/gene_based/gencode_v19/gencode.v19.BasicGeneUnits.txt
has 44541 lines (per wc -l gencode.v19.BasicGeneUnits.txt), so there appears to be some difference between my dplyr::filter() approach, and the previous grepl approach to matching the tag basic...Further, if I do
foo <- genetable::summarize_tag(gtf)
I get this:
and
Which differs from the original summary:
and
Note: there is at least one feature with the tag "basic" AND the tag "selo" - so the grepl summary appears more accurate there, at least. e.g.:
chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";