GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Inconsistency between input bed(tabix) and getFragmentsFromArrow results #481

Closed xiaosuyu1997 closed 3 years ago

xiaosuyu1997 commented 3 years ago

Attach your log file ArchR-createArrows-417c778276a-Date-2021-01-02_Time-13-42-18.log

Describe the bug While I modify the tutorial data a little bit and change the first fragment in scATAC_BMMC_R1 to: image Results from getFragmentsFromArrow("test.arrow", cellNames = "test#TAGGAGGGTTAACCGT-1") give me: image It should make more sense to me for these two results to be consistent (at least negative width should not appear, if I made a silly mistake, so sorry for that).

Code leading to this (take tabix input as an example) should be here: While saving width info to arrow file, (end - start + 1), which is the exact width, is not used: https://github.com/GreenleafLab/ArchR/blob/2f022a448d8248a0f9afb33419bcbaeafe7731c0/R/CreateArrow.R#L1281 However, while reconstructing GRange object from that, (end - start) is directly used as width, leading to one bp lose in the final results: https://github.com/GreenleafLab/ArchR/blob/2f022a448d8248a0f9afb33419bcbaeafe7731c0/R/ArrowRead.R#L187

And I just don't quite understand why it's needed here to add 1 to start(grange).

Expected behavior Consistency between input bed(tabix) and getFragmentsFromArrow results

Many thanks!

rcorces commented 3 years ago

I believe the "inconsistency" you are referring to is because BED-format data is 0-indexed and GRanges are 1-indexed. This is exquisitely well described in this blog post - https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0-and-1-coordinate-system-in-genomics/ I do not believe there is any problem with the ArchR code for this. When creating an Arrow, you are convering from 0-indexed to 1-indexed, thus you need to adjust the start. But when reading an Arrow file, you're already in 1-indexed format so you do not need to adjust anything.

If you think that this does not answer your question, feel free to comment here and I will re-open the issue.

xiaosuyu1997 commented 3 years ago

Great help and many thanks. New to this field and miss some details.

One more question:

Suppose input is in bed(0-based open interval): chr1 0 1000 chr1 1000 2000

rtracklayer::import should give me(1-based closed interval): chr1 1 1000 chr1 1001 2000

But now in ArchR, this is what I get(1-based closed interval): chr1 1 999 chr1 1001 1999

Just wonder if that's the wanted feature?

rcorces commented 3 years ago

Hmmm. Sorry if I missed this point in your original post. @jgranja24 - thoughts?

jgranja24 commented 3 years ago

Hi @xiaosuyu1997,

I think you are correct it should be (start - end + 1). That is my bad, I switched the storage from start and stop to start and width and made this error. It is now fixed in the release_1.0.1 branch.

ex

> IRanges(79, 81)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        79        81         3
rcorces commented 3 years ago

@xiaosuyu1997 - thanks for reporting this and explaining the bug clearly. My apologies for not catching this in your original post. Closing this.