GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Allowing Replicates to Annotated by Variable Other than Sample in addGroupCoverages #514

Closed imk1 closed 3 years ago

imk1 commented 3 years ago

Describe the problem that your feature request would address. I am working with a dataset that has multiple technical replicates for every biological replicate. As a result, a collaborator gave me a separate arrow file for each technical replicate and a metadata file indicating the technical and biological replicates for each cell. I would like to identify IDR reproducible peaks across the 2 biological replicates and combine all of the technical replicates for each cluster.

Describe the solution you'd like Adding an option to addGroupCoverages that allows me to select a column other Sample to annotate biological replicates would allow me to do this. Sample can be the default.

Describe alternatives you've considered I could ask the collaborator to combine the bam files and re-create the arrow files, but this would require substantial additional storage space and compute time. I would not be surprised if other researchers also have a bam file from each of multiple technical replicates from each biological replicates and would prefer not to have to combine the files for each biological replicate before using ArchR.

badoi commented 3 years ago

Hi Irene,

Here's a test of what would probably solve your problem. I think the errors were put in place for some good measures to prevent accidental unwanted problems, so this is a work-around that should be used w/ caution.

# just in case we try something stupid
proj$Sample_old = proj$Sample

# throws error, probably good to halt any insanity
proj$Sample = paste(proj$Sample, 'silly_things', sep = '_')

# ignore error, force overwrite, oh my that worked
proj@cellColData$Sample = paste(proj$Sample, 'silly_things', sep = '_') 

# control-z, undo undo!
proj@cellColData$Sample = proj$Sample_old 
proj@cellColData$Sample_old = NULL
rcorces commented 3 years ago

@imk1 - thanks for this suggestion. We havent run into this in our typical workflow but I see the utility. I dont think it will be hard to add but it might still take time. In the meantime, the suggestion from @badoi seems like a good stop gap.

imk1 commented 3 years ago

@rcorces Thanks! The suggestion from @badoi seems to work. To clarify for other users, you can do this: proj@cellColData$Sample = proj$[column indicating biological replicate]

However, this fix seems to lead to the following error when running addGroupCoverages: Error in h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) : Error in h5checktypeOrOpenLoc(). Cannot open file. File 'NA' does not exist.

jgranja24 commented 3 years ago

Hi @imk1, sorry for the delay. I am still working on trying to implement more stability with this. I still dont exactly follow why you wouldnt just want to treat each sample different, but I am hoping to have a fix soon.

imk1 commented 3 years ago

No worries about the delay, as one can always merge bam files of technical replicates (that just takes up a lot of space).

To give an example, lets say you have data from 2 mouse livers -- mouse liver 1 and mouse liver 2. Each mouse liver was split into 2 pieces. As a result, you have 4 samples -- mouse liver 1 piece A (abbreviate as 1A), mouse liver 1 piece B (abbreviate as 1B), mouse liver 2 piece A (abbreviate as 2A), and mouse liver 2 pieces B (abbreviate as 2B). You sequenced each of these separately and mapped all 4 samples in parallel to speed up read mapping, so you now have 4 bam files. However, these 4 bam files represent 2 mice. If I were to make an arrow file out of each of these and run ArchR, these would be treated as 4 biological replicates even though the represent 2 mice. To prevent this from happening, I currently would merge 1A and 1B into a large bam file, merge 2A and 2B into a large bam file, make arrows out of those, and then run ArchR on those arrows. However, this leads me to have bam files taking up twice as much space as my original bam files, and some labs have limited storage space.

Implementing this feature is not that big of a deal, as not having it requires only 1 additional pre-processing step, but it occurred to me that others might also find it useful.