Closed bheavner closed 7 years ago
Let's see how much this may affect after I resolve #13.
Note: gunzip -c /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz | grep 'tag "basic";' | grep "\tCDS\t" | wc -l 559540
So the summarize_tag results seem wrong.
Things look good through the trimming step:
> trimmed_gtf %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
feature count
<chr> <int>
1 CDS 723784
2 exon 1196293
3 gene 57820
4 Selenocysteine 114
5 start_codon 84144
6 stop_codon 76196
7 transcript 196520
8 UTR 284573
(these are the same as the sum of rows in the original summary)
Unnesting (by list columns that aren't all NA) alters these numbers:
trimmed_gtf %>% unnest(tags, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
feature count
<chr> <int>
1 CDS 1690522
2 exon 2297386
3 gene 57820
4 Selenocysteine 420
5 start_codon 190759
6 stop_codon 166155
7 transcript 326930
8 UTR 592624
trimmed_gtf %>% unnest(ccdsids, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
feature count
<chr> <int>
1 CDS 723784
2 exon 1196293
3 gene 57820
4 Selenocysteine 114
5 start_codon 84144
6 stop_codon 76196
7 transcript 196520
8 UTR 284573
trimmed_gtf %>% unnest(onts, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
feature count
<chr> <int>
1 CDS 723784
2 exon 1201412
3 gene 57820
4 Selenocysteine 114
5 start_codon 84144
6 stop_codon 76196
7 transcript 197576
8 UTR 284573
and in combination (this is slow step):
trimmed_gtf %>% unnest(tags, .drop = FALSE) %>% unnest(ccdsids, .drop = FALSE) %>% unnest(onts, .drop = FALSE) %>% group_by(feature) %>% summarize(count = n())
# A tibble: 8 × 2
feature count
<chr> <int>
1 CDS 1690522
2 exon 2302508
3 gene 57820
4 Selenocysteine 420
5 start_codon 190759
6 stop_codon 166155
7 transcript 327987
8 UTR 592624
That's a total of 5328795 obs... not clear where losing the CDS/tag basic ones...
Want to diff the grep output with what the import function is making. Here's the results of the input:
View(filter(unnested_gtf, feature == "CDS") %>% group_by(tag) %>% summarize(n()))
Here's the 559k from the grep:
gunzip -c /projects/topmed/downloaded_data/Gencode/v19/gencode.v19.annotation.gtf.gz | grep 'tag "basic";' | grep "\tCDS\t" > basic_cds.txt
grepped <- readr::read_tsv("~/basic_cds.txt",
comment = "#",
col_names = c("seqname",
"source",
"feature",
"start",
"end",
"score",
"strand",
"frame",
"attribute"))
What's in grepped that's not in filter(unnested_gtf, feature == "CDS")
?
library(compare)
comparison <- compare(unique(grepped$start), unique(filter(unnested_gtf, feature == "CDS", tag == "basic")$start), allowAll=TRUE)
str(comparison)
List of 7
$ result : logi FALSE
$ transform : chr [1:2] "shortened model" "sorted"
$ tM : int [1:211748] 6010 11206 11872 13921 22328 28905 30898 41611 45440 47393 ...
$ tC : int [1:211748] 3307 4470 5904 6010 7586 8366 8527 9207 10059 10470 ...
$ tMpartial : int [1:211748] 6010 11206 11872 13921 22328 28905 30898 41611 45440 47393 ...
$ tCpartial : int [1:211748] 3307 4470 5904 6010 7586 8366 8527 9207 10059 10470 ...
$ partialTransform: chr [1:2] "shortened model" "sorted"
- attr(*, "class")= chr "comparison"
(so is comparison.tM what's in one not the other, or comparison.tC?)
View(unnested_gtf[unnested_gtf$start == 3307 & unnested_gtf$feature == "CDS" & unnested_gtf$tag == "basic",])
View(unnested_gtf[unnested_gtf$start == 6010 & unnested_gtf$feature == "CDS" & unnested_gtf$tag == "basic",])
both have an entry with a basic tag... as do
grepped[grepped$start == 3307, ]$attribute
grepped[grepped$start == 6010, ]$attribute
So... not sure how this is different. Compare seems like not the tool?
setdiff(unique(grepped$start), unique(filter(unnested_gtf, feature == "CDS", tag == "basic")$start))
[1] 61390194 6713424 6713535 6713781 6713920 6714865 6715247 6715453 6715621 6715460 6714599 6714990 6715404 6715585 75729841 12223506
[17] 12224037 12224336 36212276 30486313 30513676 30525985 30467686 30531569 9325288
ah-HA!
> grepped[grepped$start == 61390194,]
# A tibble: 1 × 9
seqname source feature start end score strand frame
<chr> <chr> <chr> <int> <int> <chr> <chr> <int>
1 chr2 ENSEMBL CDS 61390194 61390367 . + 0
# ... with 1 more variables: attribute <chr>
> unnested_gtf$tag[unnested_gtf$start == 61390194]
[1] NA NA
So there's one that doesn't have the basic tag.
> grepped$attribute[grepped$start == 61390194]
[1] "gene_id \"ENSG00000237651.2\"; transcript_id \"ENST00000426997.1\"; gene_type \"protein_coding\"; gene_status \"KNOWN\"; gene_name \"C2orf74\"; transcript_type \"protein_coding\"; transcript_status \"KNOWN\"; transcript_name \"C2orf74-201\"; exon_number 3; exon_id \"ENSE00003547367.1\"; level 3; protein_id \"ENSP00000398725.1\"; tag \"basic\";"
for testing regex. - note that tag is the last field in the string.
str_view(sample, '(tag "(?:.+?)"; ?)+')
seems promising - make final space optional.
Original script results:
Refactored results: