CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

different total genes outputed by count in batches #239

Closed ccshao closed 6 years ago

ccshao commented 6 years ago

I tried the gene level summary with the dropseq data (GSE107122), which has 5 batches. however, I got different number of genes in batches though the same gtf annotation is used:

I follow the tutorial and here is the command for count umi_tools count --per-gene --gene-tag=XT --per-cell --wide-format-cell-counts -I examFolder.assigned_sorted.bam -S examFolder.gene.counts.tsv.gz

When R read the tsv files by fread (in data.table), I go the following log:

Read 82896 rows and 2001 (of 2001) columns from 0.311 GB file in 00:00:05 Read 86200 rows and 3001 (of 3001) columns from 0.484 GB file in 00:00:08 Read 78210 rows and 2001 (of 2001) columns from 0.293 GB file in 00:00:05 Read 82833 rows and 1993 (of 1993) columns from 0.309 GB file in 00:00:05 Read 79331 rows and 2000 (of 2000) columns from 0.297 GB file in 00:00:05

Where, rows are genes and columns are cells (in wide format). I expect the same genes in the output of count, however, many genes are droped. How umi_tools count keep and discard genes?

ccshao commented 6 years ago

Am I right that:

Thanks!

IanSudbery commented 6 years ago

HI. Sorry for not responding to this quicker, we've all been on strike here and it has been taking time to catch up with things.

count never sees the GTF, so the only way it know about genes is if the gene_id is present in a BAM entry. Thus if no read is a aligned to a gene then count won't know about it.

The same is also true for cells: if a given CB is never seen then count can't know it exists. The alternative would be to output a column for every possible CB, but for a 16nt CB (as in something like 10X) this would be a 4.3 billion column file.

ccshao commented 6 years ago

Thanks for the explaination. Actually it would be nice if the count could output all genes even if they are not presented in the data, as it is usual the case that there are several batches, and UMI_tools are employed separately to each of them. A matrix with same number of genes are easier to combine.

leonfodoulian commented 6 years ago

Hi Shao,

I am actually running umi_tools on several fastq files, each corresponding to a single cell. My workaround for this is to create the wide format matrix directly in R. You can simply not specify --wide-format-cell-counts, and do as follows in R:

# Get list of UMI counts files (output from umi_tools)
count.files <- list.files(path = ".", pattern = "counts.tsv")
names(count.files) <- count.files

# Read umi_tools count output files from direcotry and store files in a list
counts.list <- lapply(X = count.files, 
                      FUN = function(count.file) { 
                        mat <- read.table(file = count.file, header = TRUE) 
                        # Print cell name to keep track of the progress
                        cat(x = paste("Reading umi_tools count output file for cell", unique(mat$cell), sep=" "), sep = "\n")
                        return(mat)
                      })

# Bind list elements by rows
counts <- dplyr::bind_rows(counts.list)

# Get wide format UMI count table
# Fill NA with 0
counts.wf <- transform(
  tidyr::spread(data = counts, 
                key = cell, 
                value = count, 
                fill = 0, 
                drop = FALSE),
  row.names = gene,
  gene = NULL
)

Best, Leon

ccshao commented 6 years ago

cool, thank you very much for sharing the codes @leonfodoulian