Closed ccshao closed 6 years ago
Am I right that:
Thanks!
HI. Sorry for not responding to this quicker, we've all been on strike here and it has been taking time to catch up with things.
count
never sees the GTF, so the only way it know about genes is if the gene_id is present in a BAM entry. Thus if no read is a aligned to a gene then count
won't know about it.
The same is also true for cells: if a given CB is never seen then count
can't know it exists. The alternative would be to output a column for every possible CB, but for a 16nt CB (as in something like 10X) this would be a 4.3 billion column file.
Thanks for the explaination. Actually it would be nice if the count
could output all genes even if they
are not presented in the data, as it is usual the case that there are several batches, and UMI_tools are
employed separately to each of them. A matrix with same number of genes are easier to combine.
Hi Shao,
I am actually running umi_tools
on several fastq files, each corresponding to a single cell. My workaround for this is to create the wide format matrix directly in R. You can simply not specify --wide-format-cell-counts
, and do as follows in R:
# Get list of UMI counts files (output from umi_tools)
count.files <- list.files(path = ".", pattern = "counts.tsv")
names(count.files) <- count.files
# Read umi_tools count output files from direcotry and store files in a list
counts.list <- lapply(X = count.files,
FUN = function(count.file) {
mat <- read.table(file = count.file, header = TRUE)
# Print cell name to keep track of the progress
cat(x = paste("Reading umi_tools count output file for cell", unique(mat$cell), sep=" "), sep = "\n")
return(mat)
})
# Bind list elements by rows
counts <- dplyr::bind_rows(counts.list)
# Get wide format UMI count table
# Fill NA with 0
counts.wf <- transform(
tidyr::spread(data = counts,
key = cell,
value = count,
fill = 0,
drop = FALSE),
row.names = gene,
gene = NULL
)
Best, Leon
cool, thank you very much for sharing the codes @leonfodoulian
I tried the gene level summary with the dropseq data (GSE107122), which has 5 batches. however, I got different number of genes in batches though the same gtf annotation is used:
I follow the tutorial and here is the command for count
umi_tools count --per-gene --gene-tag=XT --per-cell --wide-format-cell-counts -I examFolder.assigned_sorted.bam -S examFolder.gene.counts.tsv.gz
When R read the tsv files by
fread
(in data.table), I go the following log:Where, rows are genes and columns are cells (in wide format). I expect the same genes in the output of
count
, however, many genes are droped. How umi_tools count keep and discard genes?