getCounts with 10X Cellranger output

AmandaKedaigle commented 5 years ago

I'm trying to use ChromVAR to read in output from the 10X cellranger-atac pipeline. I believe I have it working when reading in the peak_bc_matrix as my own count table, but I'd also like to use chromVAR's getCounts function on the possorted.bam files output by cellranger. (Mainly because I am combining data from several 10X lanes into one analysis, so I've calculated a union set of peaks from all cells and now want to re-calculate a peak-by-cell matrix).

The BAM files output by cellranger don't use the RG tag to indicate different cells, but a "CB" tag. Can I specify this in the "by-rg" flag somehow?

Also, since I am combining several bam files into one analysis, will cells with matching CB tags from different files be separated for me (they'll have different attributes given to colData), or should I edit the cell barcodes to be distinct before combining?

Thanks!

AmandaKedaigle commented 5 years ago

As an update, I've tried a couple different ways to get around this (both reading out my bam file as a sam and changing each CB tag to an RG tag, and by changing the by-rg code to look for CB tags instead), but it looks like there is another issue with the cellranger bams that chromVAR doesn't like as well. When I run it now I get this error:

> fragment_counts = getCounts("RGchanged.bam", peaks, paired = F, by_rg = T, format="bam")
Reading in file: RGchanged.bam
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
  solving row 43: range cannot be determined from the supplied arguments (too many NAs)

I can find this error in a few different places on google, and it seems to be related to the format of the input file not being what is expected, but I can't seem to narrow down what the problem is.

AliciaSchep commented 5 years ago

I'm not familiar with cellranger pipeline so don't know what might be the issue. If in digging into the source code (and/or reading in the files using Bioconductor tools more generally) you find something that seems to work please update as it would be great to make the code work for that use case.

AliciaSchep commented 5 years ago

Some notes from consulting with @jeffmgranja

1) The bams are aligned by bwa and there can be issues with very small fragment sizes -- this might be what is leading to bug, because after offset the size would be less than 0. Could add into code something like which(abs(scanned$isize) >= minFragSize) and have a minFragSize in the inputs like 10 bp. As a work-around now, maybe pre-filtering the bams for fragment size could help?

2) 10x has fragments bed file (already tn5 offset) with barcodes in 4th column. Potentially the function in chromVAR to read fragments from bed file could have an option to treat a column as a barcode.

x811zou commented 5 years ago

Hi! I also have the question in obtaining count table from 10x scATACseq @AmandaKedaigle how do you get the count table? Using getCount function + RG-coded-BAM file?

ayeaton commented 5 years ago

Hi @x811zou,

I also wanted to use chromVAR for 10x scATAC-Seq. I used the 10x fragments bed file and slightly changed some functions in chromVAR involved in reading fragments from bed files to treat a column as a barcode so that I could use chromVAR with our data.

I created a pull request with my method, if you are interested.

caleblareau commented 5 years ago

Here's my attempt at it as well-- https://github.com/caleblareau/scATAC_10X_raw_to_kmers/blob/master/example_kmers.R

AliciaSchep commented 5 years ago

@caleblareau What is the file that your method is using? It looks like it is a different file input from what the PR from @ayeaton would use? Would it be a complementary method to also include?

caleblareau commented 5 years ago

@AliciaSchep the "fragments.tsv.gz" file that is the basis of the fragments is the same. The additional file that I reference is essentially the barcodes that pass filter based on 10X's pipeline. My understanding (@ayeaton please correct me if I'm wrong) is that her function will establish a column for every conceivable barcode observed, which I think could run into problems.

Overall, they do seem complimentary; I can package up a PR if there's enough interest.

AliciaSchep commented 5 years ago

Ok thanks for explanation @caleblareau. Being able to filter out barcodes that don't pass filter based on a prior pipeline seems like it could be useful. Maybe getCounts should have an optional barcodes feature (that might not only be applicable to 10x case?) that limits results to barcodes provided. If that is the main difference then it seems like that could be added into what @ayeaton has already provided.

Although it looks like your approach to @caleblareau deviates a bit more from the previous chromVAR bed file approach... wondering if there are speed/memory improvements that become especially relevant with 10x data. It could be having a separate getCounts10x function would make sense.

caleblareau commented 5 years ago

I think that it's relatively easy to do a post-hoc filtering from the standard getCounts, so I wouldn't worry too much about it in general (unless you are feeling over-ambitious).

Yes, I wrote my function with speed/memory concerns specifically in mind. I haven't tried the other proposed function, but I can say that I'm quite happy with the performance of my function acknowledging the size of the data.

I wrote a modification of getCounts to go from a barcoded bam to a SE using this logic, and it performs much faster than the standard getCounts function (https://github.com/caleblareau/chromVARxx/blob/master/R/getCounts-tweaks.R)

AliciaSchep commented 5 years ago

No I'm not feeling over-ambitious :rofl:

I think for now I'll probably look to incorporate the PR from @ayeaton close to as-is

@caleblareau do you mind if I link to chromVARxx in README? (Would also be happy to incorporate faster version of getCounts into chromVAR if you want to put in a PR, but for now maybe if people can be pointed to versions that might be more performant if they run into issue? happy to also advertise any other capabilities that chromVARxx might add)

pguilha commented 5 years ago

have these approaches been tested/compared? what is your recommendation on loading 10x data in the end?

moonriver2002 commented 5 years ago

Hi @x811zou,

I also wanted to use chromVAR for 10x scATAC-Seq. I used the 10x fragments bed file and slightly changed some functions in chromVAR involved in reading fragments from bed files to treat a column as a barcode so that I could use chromVAR with our data.

I created a pull request with my method, if you are interested.

I am interested because I am trying to use chromVAR to use our 10Xgenomic ATAC data. Basically, I can read the "fragments.tsv" bed files for each sample as as fragment counts library and I have peakcall calculated for each sample/combined samples for each group. I like to know how to get the fragment_counts object into chromVAR from here.

Thanks.

Yongqing Zhang

GreenleafLab / chromVAR

getCounts with 10X Cellranger output #44