Non-equivalence of BAM and fragment file input

rcorces commented 3 years ago

Describe the bug I am unable to get exactly the same Arrow files when using BAM and fragment file inputs from the same data set. For example, using the possorted_bam output from 10x and the fragments.tsv.gz output from 10x give different pass-filter cells and different fragment counts. This could be caused by:

Not knowing the precise parameters to pass to createArrow() when reading in a BAM file in order to recapitulate the behavior of CellRanger when it creates the fragment file.
Inconsistencies between how CellRanger obtains fragments and how ArchR obtains fragments from a BAM file.

To Reproduce First, using bash, download example data from 10x and tidy it up:

wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam.bai
wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_pbmc_500_nextgem/atac_pbmc_500_nextgem_possorted_bam.bam

#filter reads that dont have an annotated barcode in CB: otherwise these break createArrowFiles
#Honestly, the error thrown is super cryptic and should be improved
samtools view -h atac_pbmc_500_nextgem_possorted_bam.bam | grep 'CB:Z:\|@' | samtools view -h -b - >atac_pbmc_500_nextgem_possorted_bam_filt.bam

Then, switch to R and run:

bamfile_10x <- "./atac_pbmc_500_nextgem_possorted_bam_filt.bam"
fragfile_10x <- "./atac_pbmc_500_nextgem_fragments.tsv.gz"
addArchRGenome("hg19")
ArrowBam10x <- createArrowFiles(
  inputFiles = bamfile_10x,
  sampleNames = "bam10x",
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = FALSE,
  addGeneScoreMat = FALSE,
  bcTag = "CB",
  bamFlag = list(isMinusStrand = FALSE, isProperPair = TRUE, isDuplicate = FALSE),
  force = TRUE
)
ArrowFrag10x <- createArrowFiles(
  inputFiles = fragfile_10x,
  sampleNames = "frag10x",
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000, 
  addTileMat = FALSE,
  addGeneScoreMat = FALSE,
  force = TRUE
)
frag10x <- getFragmentsFromArrow(ArrowFile = ArrowFrag10x)
bam10x <- getFragmentsFromArrow(ArrowFile = ArrowBam10x)

frag10x_barcodes <- unique(gsub(pattern = "frag10x#", replacement = "", x = unique(frag10x$RG)))
bam10x_barcodes <- unique(gsub(pattern = "bam10x#", replacement = "", x = unique(bam10x$RG)))

The number of cells, number of fragments, etc are all different:

> length(unique(bam10x$RG))
[1] 525
> length(unique(frag10x$RG))
[1] 524
> length(frag10x)
[1] 8587941
> length(bam10x)
[1] 9163379
> which(frag10x_barcodes %ni% bam10x_barcodes)
[1] 1
> which(bam10x_barcodes %ni% frag10x_barcodes)
[1] 524 525

Expected behavior In an ideal world, the arrow files resulting from BAM and fragment input should be identical. Its possible that there are input parameters that are not optimized on my end but if this is the case then we should improve this documentation.

Session Info

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 8 (Core)

Matrix products: default
BLAS:   /opt/R/3.6.3/lib64/R/lib/libRblas.so
LAPACK: /opt/R/3.6.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm10_1.4.0 presto_1.0.0                       Rcpp_1.0.5                        
 [4] nabor_0.5.0                        seqLogo_1.52.0                     chromVARmotifs_0.2.0              
 [7] motifmatchr_1.8.0                  circlize_0.4.10                    ComplexHeatmap_2.2.0              
[10] ggrastr_0.2.2                      BSgenome.Hsapiens.UCSC.hg19_1.4.0  gridExtra_2.3                     
[13] BSgenome.Hsapiens.UCSC.hg38_1.4.1  BSgenome_1.54.0                    rtracklayer_1.46.0                
[16] ArchR_1.0.1                        magrittr_1.5                       rhdf5_2.30.1                      
[19] Matrix_1.2-18                      data.table_1.13.0                  SummarizedExperiment_1.16.1       
[22] DelayedArray_0.12.3                BiocParallel_1.20.1                matrixStats_0.57.0                
[25] Biobase_2.46.0                     ggplot2_3.3.2                      Rsamtools_2.2.3                   
[28] Biostrings_2.54.0                  XVector_0.26.0                     GenomicRanges_1.38.0              
[31] GenomeInfoDb_1.22.1                IRanges_2.20.2                     S4Vectors_0.24.4                  
[34] BiocGenerics_0.32.0               

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0            colorspace_1.4-1            rjson_0.2.20                ellipsis_0.3.1             
 [5] rprojroot_1.3-2             GlobalOptions_0.1.2         fs_1.5.0                    clue_0.3-57                
 [9] rstudioapi_0.11             farver_2.0.3                remotes_2.2.0               ggrepel_0.8.2              
[13] bit64_4.0.5                 AnnotationDbi_1.48.0        fansi_0.4.1                 codetools_0.2-16           
[17] R.methodsS3_1.8.1           pkgload_1.1.0               Cairo_1.5-12.2              annotate_1.64.0            
[21] GO.db_3.10.0                cluster_2.1.0               R.oo_1.24.0                 png_0.1-7                  
[25] httr_1.4.2                  BiocManager_1.30.10         readr_1.3.1                 compiler_3.6.3             
[29] backports_1.1.10            assertthat_0.2.1            cli_2.0.2                   prettyunits_1.1.1          
[33] tools_3.6.3                 gtable_0.3.0                glue_1.4.2                  TFMPvalue_0.0.8            
[37] GenomeInfoDbData_1.2.2      reshape2_1.4.4              dplyr_1.0.2                 vctrs_0.3.4                
[41] CNEr_1.22.0                 stringr_1.4.0               ps_1.3.4                    testthat_2.3.2             
[45] lifecycle_0.2.0             poweRlaw_0.70.6             gtools_3.8.2                devtools_2.3.2             
[49] XML_3.99-0.3                zlibbioc_1.32.0             MASS_7.3-51.5               scales_1.1.1               
[53] hms_0.5.3                   RColorBrewer_1.1-2          curl_4.3                    memoise_1.1.0              
[57] stringi_1.5.3               RSQLite_2.2.0               desc_1.2.0                  caTools_1.18.0             
[61] pkgbuild_1.1.0              shape_1.4.5                 rlang_0.4.7                 pkgconfig_2.0.3            
[65] bitops_1.0-6                pracma_2.2.9                lattice_0.20-38             purrr_0.3.4                
[69] Rhdf5lib_1.8.0              GenomicAlignments_1.22.1    labeling_0.3                bit_4.0.4                  
[73] processx_3.4.4              tidyselect_1.1.0            plyr_1.8.6                  R6_2.4.1                   
[77] generics_0.0.2              DBI_1.1.0                   pillar_1.4.6                withr_2.3.0                
[81] KEGGREST_1.26.1             RCurl_1.98-1.2              tibble_3.0.3                crayon_1.3.4               
[85] GetoptLong_1.0.2            TFBSTools_1.24.0            usethis_1.6.3               blob_1.2.1                 
[89] callr_3.4.4                 digest_0.6.25               xtable_1.8-4                R.utils_2.10.1             
[93] munsell_0.5.0               DirichletMultinomial_1.28.0 beeswarm_0.2.3              vipor_0.4.5                
[97] sessioninfo_1.1.1

trinevd commented 3 years ago

Hi. Thanks for developing this great and user-friendly tool.

Reading this issue, I thought I would post a related question in the same thread, as it may explain why Arrow files end up being different depending on whether you use the Fragment file or the BAM file as input. Alternatively, it may be a very naive question.

I use Cellranger’s fragment files as input into my ArchR analysis. From reading your publication + other issues I get the impression that this is the primary input file type, although BAM files also can be used.

My question: • The default parameters in the createArrowFile() command adjust the fragment position with +4/-5 bp to take into account how tn5 cuts chromatin (offsetPlus and offsetMinus parameters).

• When I read about the cellranger fragment file here, it seems like the same fragment position adjustment is performed when this file is generated.

• Does this mean, by using the ArchR default parameters on a fragment file input, each fragment ends up being double adjusted in the .arrow file? (i.e. offsetting +8/-10 bp)

I’m aware that an easy solution here is to edit the offsetPlus/offsetMinus parameters in the createArrowFile() command. However, I can see from above that you are using default +4/-5 bp parameters, also when the input file is the Cellranger fragment file, so maybe there is a detail I am missing.

Did you consider that the fragment position has already been adjusted in the Cellranger fragment file and thus should not be adjusted again when creating arrow files?

Best regards, Trine

Session info

sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS/LAPACK: /data/home/trinevd/.conda/envs/ArchR_r4/lib/libopenblasp-r0.3.12.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] ArchR_1.0.1 magrittr_2.0.1 [3] rhdf5_2.34.0 Matrix_1.3-2 [5] data.table_1.14.0 SummarizedExperiment_1.20.0 [7] Biobase_2.50.0 GenomicRanges_1.42.0 [9] GenomeInfoDb_1.26.2 IRanges_2.24.1 [11] S4Vectors_0.28.1 BiocGenerics_0.36.0 [13] MatrixGenerics_1.2.1 matrixStats_0.58.0 [15] ggplot2_3.3.3

loaded via a namespace (and not attached): [1] tidyselect_1.1.0 purrr_0.3.4 [3] lattice_0.20-41 colorspace_2.0-0 [5] vctrs_0.3.6 generics_0.1.0 [7] rtracklayer_1.50.0 XML_3.99-0.5 [9] utf8_1.1.4 rlang_0.4.10 [11] pillar_1.5.0 glue_1.4.2 [13] withr_2.4.1 DBI_1.1.1 [15] BiocParallel_1.24.1 GenomeInfoDbData_1.2.4 [17] lifecycle_1.0.0 zlibbioc_1.36.0 [19] Biostrings_2.58.0 munsell_0.5.0 [21] gtable_0.3.0 Cairo_1.5-12.2 [23] fansi_0.4.2 Rcpp_1.0.6 [25] BSgenome.Mmusculus.UCSC.mm10_1.4.0 scales_1.1.1 [27] BSgenome_1.58.0 DelayedArray_0.16.1 [29] XVector_0.30.0 Rsamtools_2.6.0 [31] dplyr_1.0.4 grid_4.0.2 [33] bitops_1.0-6 tools_4.0.2 [35] rhdf5filters_1.2.0 RCurl_1.98-1.2 [37] tibble_3.0.6 crayon_1.4.1 [39] pkgconfig_2.0.3 ellipsis_0.3.1 [41] assertthat_0.2.1 rstudioapi_0.13 [43] Rhdf5lib_1.12.1 R6_2.5.0 [45] GenomicAlignments_1.26.0 compiler_4.0.2

rcorces commented 3 years ago

@trinevd - the documentation could be better on this. offsetPlus and offsetMinus are only used reading in a bam file, not when reading in a fragment file.

https://github.com/GreenleafLab/ArchR/blob/ecfe12a6b4df80beec1bf16be3b398433a7742bd/R/CreateArrow.R#L389

We should update this because if someone had an unshifted fragment file they might expect createArrowFiles() to apply the offset and it will not (though by our definition and 10x's definition, a fragment file has already been offset - https://www.archrproject.com/bookdown/a-brief-primer-on-atac-seq-terminology.html).

I have updated this on the dev branch via https://github.com/GreenleafLab/ArchR/commit/9f008ad2da9fd0a273ac5927c41eb4b0db4ff32a

trinevd commented 3 years ago

Hi again,

Thanks a lot for the clarification and your quick reply.

Best, Trine.

cflerin commented 3 years ago

Related to this, Cell Ranger performs an adjustment to the positions of the fragments when there is soft clipping present in the reads. (it adds the soft-clipped portion of the read back). Technically this is not how the positions in the SAM format are reported according to the spec. Maybe this could help explain a part of the differences in bam/fragments input (although you don't compare the positions in the example above). You can see a description of this in this issue: https://github.com/timoast/sinto/issues/29 (it's for a different tool, but I found that Cell Ranger does this soft clipping adjustment as well).

Something else that I noticed is how Cell Ranger collapses fragments with different barcodes in the same interval, and chooses one of the barcodes to represent that interval. I believe that it makes each genomic interval in the fragments file unique.

rcorces commented 3 years ago

@cflerin - thanks for posting this. I had not realized the soft clipping.

With regards to this:

Something else that I noticed is how Cell Ranger collapses fragments with different barcodes in the same interval, and chooses one of the barcodes to represent that interval. I believe that it makes each genomic interval in the fragments file unique.

I believe that 10x fixed this in their v2 software. Now duplicates must have the same barcode, start, and end, rather than just start and end. https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/release-notes

cflerin commented 3 years ago

Ah, you're right, I must be thinking of the older version. A quick check of a fragments file from Cell Ranger ATAC 2.0.0 seems to confirm this. I did spend some time trying to replicate the Cell Ranger fragments file generation, but could never quite figure out why I was getting a slightly different set of fragments in the end.

rcorces commented 2 years ago

I'm closing this issue as I dont think there is an easy solution for this.

dagarfield commented 1 year ago

Probably solved and old, but we were pondering this today and noted a difference in how shifts are described in ArchR and 10x. From the 10x description

The BED interval of the fragment is obtained by adjusting the BAM alignment interval of the sequenced read-pair. The start of the interval is moved forward by 4 bp from a left-most alignment position and backward 5 bp from the right-most alignment position.

Note the reference to left and right-most rather than + and minus.

From ArchR (in this case the description of createArrowFiles, which closely mirrors the description here)

*offsetPlus | The numeric offset to apply to a "+" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.

*offsetMinus | The numeric offset to apply to a "-" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.

I've also used the +/- convention in my own code, but does the wording here suggest that ArchR (when working from BAM files) shifts fragments in a strand-aware way while 10x does not?

The context here is trying to figure out base-pair level modeling of pseudo-bulked scATAC data with, eg. ChromBPNet.

rcorces commented 1 year ago

@dagarfield - the wording in ArchR isnt the most accurate. We shift the same way 10x shifts (left side vs right side).

For posterity, the code that does the offset is here: https://github.com/GreenleafLab/ArchR/blob/472fbb00b7377aeab05fa3d66d546cf4277555b2/R/CreateArrow.R#L1540-L1544

And I've changed the parameter description on dev via https://github.com/GreenleafLab/ArchR/commit/40a8ac8b32bbcc97757679b31d5d061776e112f9

GreenleafLab / ArchR

Non-equivalence of BAM and fragment file input #589