filtering error - Githubissues

bheavner commented 7 years ago

After I do this:

gtf <- genetable::import_gencode('/projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz')

filtered <- genetable::filter_gencode(gtf)

bounds <- genetable::define_boundaries(filtered)

The bounds object has 41614 observations, and the file at /projects/topmed/gac_data/aggregation_units/gene_based/gencode_v19/gencode.v19.BasicGeneUnits.txt has 44541 lines (per wc -l gencode.v19.BasicGeneUnits.txt), so there appears to be some difference between my dplyr::filter() approach, and the previous grepl approach to matching the tag basic...

Further, if I do foo <- genetable::summarize_tag(gtf)

I get this:

                  FALSE   TRUE
  CDS            548592 140893
  exon           657411 265976
  gene            11557      0
  Selenocysteine    114      0
  start_codon     60695  15116
  stop_codon      52654  15096
  transcript      87376  50542
  UTR            181917  47013

and

colSums(foo)
  FALSE    TRUE
1600316  534636

Which differs from the original summary:

bar <- table(gt$feature,grepl("tag basic",gt$attribute))

                  FALSE   TRUE
  CDS            164244 559540
  exon           471961 724332
  gene            57820      0
  Selenocysteine     24     90
  start_codon     27068  57076
  stop_codon      19163  57033
  transcript     100718  95802
  UTR            125350 159223

and

colSums(bar)
  FALSE    TRUE
 966348 1653096

Note: there is at least one feature with the tag "basic" AND the tag "selo" - so the grepl summary appears more accurate there, at least. e.g.: chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";

bheavner commented 7 years ago

Looks like the problem is with my import - there can be multiple tag fields, but I'm using this regular expression:

  rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
              '.*; gene_name "(.+?)"; transcript_type "(.+?)";',
              'transcript_status "(.+?)";',
              'transcript_name "(.+?)";( .*; tag "(.+?)";)*', sep = " ")

  parsed_gtf <- gtf %>%
  tidyr::extract(attribute,
          c("gene_id",
            "transcript_id",
            "gene_type",
            "gene_name",
            "transcript_type",
            "transcript_status",
            "transcript_name",
            "extra",
            "tag"),
          rx, remove = FALSE) %>%
  dplyr::select(-attribute, -extra)

bheavner commented 7 years ago

For troubleshooting:

sample = 'chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";'

rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
            '.*; gene_name "(.+?)"; transcript_type "(.+?)";',
            'transcript_status "(.+?)";',
            'transcript_name "(.+?)"; .* (tag "(.+?)";)*', sep = " ")

tidyr::extract(as_data_frame(sample), value, c("gene_id",
  "transcript_id",
  "gene_type",
  "gene_name",
  "transcript_type",
  "transcript_status",
  "transcript_name",
  "extra",
  "tag"),
rx)

bheavner commented 7 years ago

If I extract multiple tags with the regex, I'm still going to have to figure out how to turn that into a list for tidyr::extract, or to parse it... :( Maybe just end up doing the tag basic after all. :(

bheavner commented 7 years ago

This works on regexr, and gives 4 matches on the sample: (?:tag \\"([^"]+))"; (or in R, '(?:tag "([^"]+)")', but tidyr::extract(as_data_frame(sample), value, "into", regex = '(?:tag "([^"]+)")' ) only returns the first match...

bheavner commented 7 years ago

maybe just do a "is basic" column using regex = '(?:tag "basic")'?

bheavner commented 7 years ago

sample = 'chr2 HAVANA stop_codon 26611969 26611971 . + 0 gene_id "ENSG00000138018.13"; transcript_id "ENST00000260585.7"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "EPT1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "EPT1-001"; exon_number 10; exon_id "ENSE00001535481.2"; level 2; protein_id "ENSP00000260585.7"; tag "basic"; tag "appris_principal"; tag "CCDS"; tag "seleno"; ccdsid "CCDS46240.1"; havana_gene "OTTHUMG00000151931.3"; havana_transcript "OTTHUMT00000324484.3";'

rx <- paste('gene_id "(.+?)"; transcript_id "(.+?)"; gene_type "(.+?)";',
            '.*; gene_name "(.+?)"; transcript_type "(.+?)";',
            'transcript_status "(.+?)";',
            'transcript_name "(.+?)"; .* tag "(basic)"', sep = " ")

tidyr::extract(as_data_frame(sample), value, c("gene_id",
  "transcript_id",
  "gene_type",
  "gene_name",
  "transcript_type",
  "transcript_status",
  "transcript_name",
  "tag"),
rx)

bheavner commented 7 years ago

I'm working on this in ~/script_backup.R

UW-GAC / genetable

filtering error #9