Closed rcorces closed 2 years ago
Hi. Thanks for developing this great and user-friendly tool.
Reading this issue, I thought I would post a related question in the same thread, as it may explain why Arrow files end up being different depending on whether you use the Fragment file or the BAM file as input. Alternatively, it may be a very naive question.
I use Cellranger’s fragment files as input into my ArchR analysis. From reading your publication + other issues I get the impression that this is the primary input file type, although BAM files also can be used.
My question: • The default parameters in the createArrowFile() command adjust the fragment position with +4/-5 bp to take into account how tn5 cuts chromatin (offsetPlus and offsetMinus parameters).
• When I read about the cellranger fragment file here, it seems like the same fragment position adjustment is performed when this file is generated.
• Does this mean, by using the ArchR default parameters on a fragment file input, each fragment ends up being double adjusted in the .arrow file? (i.e. offsetting +8/-10 bp)
I’m aware that an easy solution here is to edit the offsetPlus/offsetMinus parameters in the createArrowFile() command. However, I can see from above that you are using default +4/-5 bp parameters, also when the input file is the Cellranger fragment file, so maybe there is a detail I am missing.
Did you consider that the fragment position has already been adjusted in the Cellranger fragment file and thus should not be adjusted again when creating arrow files?
Best regards, Trine
Session info
sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS
Matrix products: default BLAS/LAPACK: /data/home/trinevd/.conda/envs/ArchR_r4/lib/libopenblasp-r0.3.12.so
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] ArchR_1.0.1 magrittr_2.0.1 [3] rhdf5_2.34.0 Matrix_1.3-2 [5] data.table_1.14.0 SummarizedExperiment_1.20.0 [7] Biobase_2.50.0 GenomicRanges_1.42.0 [9] GenomeInfoDb_1.26.2 IRanges_2.24.1 [11] S4Vectors_0.28.1 BiocGenerics_0.36.0 [13] MatrixGenerics_1.2.1 matrixStats_0.58.0 [15] ggplot2_3.3.3
loaded via a namespace (and not attached): [1] tidyselect_1.1.0 purrr_0.3.4 [3] lattice_0.20-41 colorspace_2.0-0 [5] vctrs_0.3.6 generics_0.1.0 [7] rtracklayer_1.50.0 XML_3.99-0.5 [9] utf8_1.1.4 rlang_0.4.10 [11] pillar_1.5.0 glue_1.4.2 [13] withr_2.4.1 DBI_1.1.1 [15] BiocParallel_1.24.1 GenomeInfoDbData_1.2.4 [17] lifecycle_1.0.0 zlibbioc_1.36.0 [19] Biostrings_2.58.0 munsell_0.5.0 [21] gtable_0.3.0 Cairo_1.5-12.2 [23] fansi_0.4.2 Rcpp_1.0.6 [25] BSgenome.Mmusculus.UCSC.mm10_1.4.0 scales_1.1.1 [27] BSgenome_1.58.0 DelayedArray_0.16.1 [29] XVector_0.30.0 Rsamtools_2.6.0 [31] dplyr_1.0.4 grid_4.0.2 [33] bitops_1.0-6 tools_4.0.2 [35] rhdf5filters_1.2.0 RCurl_1.98-1.2 [37] tibble_3.0.6 crayon_1.4.1 [39] pkgconfig_2.0.3 ellipsis_0.3.1 [41] assertthat_0.2.1 rstudioapi_0.13 [43] Rhdf5lib_1.12.1 R6_2.5.0 [45] GenomicAlignments_1.26.0 compiler_4.0.2
@trinevd - the documentation could be better on this. offsetPlus
and offsetMinus
are only used reading in a bam file, not when reading in a fragment file.
We should update this because if someone had an unshifted fragment file they might expect createArrowFiles()
to apply the offset and it will not (though by our definition and 10x's definition, a fragment file has already been offset - https://www.archrproject.com/bookdown/a-brief-primer-on-atac-seq-terminology.html).
I have updated this on the dev branch via https://github.com/GreenleafLab/ArchR/commit/9f008ad2da9fd0a273ac5927c41eb4b0db4ff32a
Hi again,
Thanks a lot for the clarification and your quick reply.
Best, Trine.
Related to this, Cell Ranger performs an adjustment to the positions of the fragments when there is soft clipping present in the reads. (it adds the soft-clipped portion of the read back). Technically this is not how the positions in the SAM format are reported according to the spec. Maybe this could help explain a part of the differences in bam/fragments input (although you don't compare the positions in the example above). You can see a description of this in this issue: https://github.com/timoast/sinto/issues/29 (it's for a different tool, but I found that Cell Ranger does this soft clipping adjustment as well).
Something else that I noticed is how Cell Ranger collapses fragments with different barcodes in the same interval, and chooses one of the barcodes to represent that interval. I believe that it makes each genomic interval in the fragments file unique.
@cflerin - thanks for posting this. I had not realized the soft clipping.
With regards to this:
Something else that I noticed is how Cell Ranger collapses fragments with different barcodes in the same interval, and chooses one of the barcodes to represent that interval. I believe that it makes each genomic interval in the fragments file unique.
I believe that 10x fixed this in their v2
software. Now duplicates must have the same barcode, start, and end, rather than just start and end.
https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/release-notes
Ah, you're right, I must be thinking of the older version. A quick check of a fragments file from Cell Ranger ATAC 2.0.0 seems to confirm this. I did spend some time trying to replicate the Cell Ranger fragments file generation, but could never quite figure out why I was getting a slightly different set of fragments in the end.
I'm closing this issue as I dont think there is an easy solution for this.
Probably solved and old, but we were pondering this today and noted a difference in how shifts are described in ArchR and 10x. From the 10x description
The BED interval of the fragment is obtained by adjusting the BAM alignment interval of the sequenced read-pair. The start of the interval is moved forward by 4 bp from a left-most alignment position and backward 5 bp from the right-most alignment position.
Note the reference to left and right-most rather than + and minus.
From ArchR (in this case the description of createArrowFiles
, which closely mirrors the description here)
*offsetPlus | The numeric offset to apply to a "+" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.
*offsetMinus | The numeric offset to apply to a "-" stranded Tn5 insertion to account for the precise Tn5 binding site. See Buenrostro et al. Nature Methods 2013.
I've also used the +/- convention in my own code, but does the wording here suggest that ArchR (when working from BAM files) shifts fragments in a strand-aware way while 10x does not?
The context here is trying to figure out base-pair level modeling of pseudo-bulked scATAC data with, eg. ChromBPNet.
@dagarfield - the wording in ArchR isnt the most accurate. We shift the same way 10x shifts (left side vs right side).
For posterity, the code that does the offset is here: https://github.com/GreenleafLab/ArchR/blob/472fbb00b7377aeab05fa3d66d546cf4277555b2/R/CreateArrow.R#L1540-L1544
And I've changed the parameter description on dev
via https://github.com/GreenleafLab/ArchR/commit/40a8ac8b32bbcc97757679b31d5d061776e112f9
Describe the bug I am unable to get exactly the same Arrow files when using BAM and fragment file inputs from the same data set. For example, using the possorted_bam output from 10x and the fragments.tsv.gz output from 10x give different pass-filter cells and different fragment counts. This could be caused by:
createArrow()
when reading in a BAM file in order to recapitulate the behavior of CellRanger when it creates the fragment file.To Reproduce First, using bash, download example data from 10x and tidy it up:
Then, switch to R and run:
The number of cells, number of fragments, etc are all different:
Expected behavior In an ideal world, the arrow files resulting from BAM and fragment input should be identical. Its possible that there are input parameters that are not optimized on my end but if this is the case then we should improve this documentation.
Session Info