summary of imported gtf differs from script results..

bheavner commented 7 years ago

Original script results:

bar <- table(gt$feature,grepl("tag basic",gt$attribute))

                  FALSE   TRUE
  CDS            164244 559540
  exon           471961 724332
  gene            57820      0
  Selenocysteine     24     90
  start_codon     27068  57076
  stop_codon      19163  57033
  transcript     100718  95802
  UTR            125350 159223

colSums(bar)
  FALSE    TRUE
 966348 1653096

Refactored results:

summarize_tag(gtf)

                   FALSE    TRUE
  CDS            1096665  559399
  exon           1300786  715171
  gene             11506       0
  Selenocysteine     330      90
  start_codon     125346   57064
  stop_codon      100672   57022
  transcript      172843   86798
  UTR             377737  159169

colSums(summarize_tag(gtf))
  FALSE    TRUE 
3185885 1634713

bheavner commented 7 years ago

Let's see how much this may affect after I resolve #13.

bheavner commented 7 years ago

Note: gunzip -c /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz | grep 'tag "basic";' | grep "\tCDS\t" | wc -l 559540

So the summarize_tag results seem wrong.

bheavner commented 7 years ago

Things look good through the trimming step:

> trimmed_gtf %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
         feature   count
           <chr>   <int>
1            CDS  723784
2           exon 1196293
3           gene   57820
4 Selenocysteine     114
5    start_codon   84144
6     stop_codon   76196
7     transcript  196520
8            UTR  284573

(these are the same as the sum of rows in the original summary)

Unnesting (by list columns that aren't all NA) alters these numbers:

trimmed_gtf %>% unnest(tags, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
         feature   count
           <chr>   <int>
1            CDS 1690522
2           exon 2297386
3           gene   57820
4 Selenocysteine     420
5    start_codon  190759
6     stop_codon  166155
7     transcript  326930
8            UTR  592624

trimmed_gtf %>% unnest(ccdsids, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
         feature   count
           <chr>   <int>
1            CDS  723784
2           exon 1196293
3           gene   57820
4 Selenocysteine     114
5    start_codon   84144
6     stop_codon   76196
7     transcript  196520
8            UTR  284573

trimmed_gtf %>% unnest(onts, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
         feature   count
           <chr>   <int>
1            CDS  723784
2           exon 1201412
3           gene   57820
4 Selenocysteine     114
5    start_codon   84144
6     stop_codon   76196
7     transcript  197576
8            UTR  284573

and in combination (this is slow step):

trimmed_gtf %>% unnest(tags, .drop = FALSE) %>% unnest(ccdsids, .drop = FALSE) %>% unnest(onts, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
         feature   count
           <chr>   <int>
1            CDS 1690522
2           exon 2302508
3           gene   57820
4 Selenocysteine     420
5    start_codon  190759
6     stop_codon  166155
7     transcript  327987
8            UTR  592624

That's a total of 5328795 obs... not clear where losing the CDS/tag basic ones...

bheavner commented 7 years ago

Want to diff the grep output with what the import function is making. Here's the results of the input:

View(filter(unnested_gtf, feature == "CDS") %>% group_by(tag) %>% summarize(n()))

Here's the 559k from the grep:

gunzip -c /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz | grep 'tag "basic";' | grep "\tCDS\t" > basic_cds.txt

grepped <- readr::read_tsv("~/basic_cds.txt",
                       comment = "#",
                       col_names = c("seqname",
                                     "source",
                                     "feature",
                                     "start",
                                     "end",
                                     "score",
                                     "strand",
                                     "frame",
                                     "attribute"))

What's in grepped that's not in filter(unnested_gtf, feature == "CDS") ?

library(compare)
comparison <- compare(unique(grepped$start), unique(filter(unnested_gtf, feature == "CDS", tag == "basic")$start), allowAll=TRUE)

str(comparison)
List of 7
 $ result          : logi FALSE
 $ transform       : chr [1:2] "shortened model" "sorted"
 $ tM              : int [1:211748] 6010 11206 11872 13921 22328 28905 30898 41611 45440 47393 ...
 $ tC              : int [1:211748] 3307 4470 5904 6010 7586 8366 8527 9207 10059 10470 ...
 $ tMpartial       : int [1:211748] 6010 11206 11872 13921 22328 28905 30898 41611 45440 47393 ...
 $ tCpartial       : int [1:211748] 3307 4470 5904 6010 7586 8366 8527 9207 10059 10470 ...
 $ partialTransform: chr [1:2] "shortened model" "sorted"
 - attr(*, "class")= chr "comparison"

(so is comparison.tM what's in one not the other, or comparison.tC?)

View(unnested_gtf[unnested_gtf$start == 3307 & unnested_gtf$feature == "CDS" & unnested_gtf$tag == "basic",])
View(unnested_gtf[unnested_gtf$start == 6010 & unnested_gtf$feature == "CDS" & unnested_gtf$tag == "basic",])

both have an entry with a basic tag... as do

grepped[grepped$start == 3307, ]$attribute
grepped[grepped$start == 6010, ]$attribute

So... not sure how this is different. Compare seems like not the tool?

bheavner commented 7 years ago

setdiff(unique(grepped$start), unique(filter(unnested_gtf, feature == "CDS", tag == "basic")$start))
 [1] 61390194  6713424  6713535  6713781  6713920  6714865  6715247  6715453  6715621  6715460  6714599  6714990  6715404  6715585 75729841 12223506
[17] 12224037 12224336 36212276 30486313 30513676 30525985 30467686 30531569  9325288

ah-HA!

> grepped[grepped$start == 61390194,]
# A tibble: 1 × 9
  seqname  source feature    start      end score strand frame
    <chr>   <chr>   <chr>    <int>    <int> <chr>  <chr> <int>
1    chr2 ENSEMBL     CDS 61390194 61390367     .      +     0
# ... with 1 more variables: attribute <chr>

> unnested_gtf$tag[unnested_gtf$start == 61390194]
[1] NA NA

So there's one that doesn't have the basic tag.

bheavner commented 7 years ago

> grepped$attribute[grepped$start == 61390194]
[1] "gene_id \"ENSG00000237651.2\"; transcript_id \"ENST00000426997.1\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"C2orf74\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"C2orf74-201\"; exon_number 3; exon_id \"ENSE00003547367.1\"; level 3; protein_id \"ENSP00000398725.1\"; tag \"basic\";"

for testing regex. - note that tag is the last field in the string.

bheavner commented 7 years ago

str_view(sample, '(tag "(?:.+?)"; ?)+') seems promising - make final space optional.

UW-GAC / genetable

summary of imported gtf differs from script results.. #16