Closed immanuelazn closed 2 months ago
Here's some initial comments I came up with from a read-through. Overall looks like a pretty good structure and approach. I'm focusing on some higher-level comments here.
There are also several small things that I think could be tweaked around choices of naming, docs, default arguments, organization etc. Rather than dictate solutions to those while you're still drafting, I'll let you finish up and do your own polishing pass before we discuss those.
C++ comments:
writeInsertionBedByPseudobulk
should write out duplicate coordinates rather than explicitly skipping printing repeatsgzprintf
is taking up a bunch of time (and comparing with/without a .gz
file ending to isolate if the compression or the printf
part is slow)
macs3
can support any other kinds of compressed input that would be faster to writeInsertionIterator
as wellR comments
step
, I'm a bit concerned about errors that might happen if users edit the input folder between steps, or try calling a later step with arguments that are mismatched/incompatible with their earlier step
cell_names
non-nullable, and maybe even fragments
, though that might not be worth it given the usability trade-off.prep-inputs
should probably write the macs command to a .sh
file (one-per-cluster). Also returning the commands is as a vector is fine, but later steps could then read from the .sh
files to simplify
--shift
. You might want to check the ArchR source a bit more for that as I think arguments are listed in at least two separate places mclapply
we probably also want to always use the argument mc.preschedule=FALSE
There are a few more tests to be added that directly compare writing insertions from fragments in ArchR vs BPCells. I might just copy the output files to eliminate the dependence of building an entire ArchR project.
Otherwise, I think I've addressed most of your comments, and the feature is in a relatively polished state.
Looking pretty good! Most remaining comments are style-related, though I did end up with a large number of those in call_macs_peaks()
. Trying out a bulleted list here in rough order of the code rather than using Github's inline comment feature since it's a bit easier for me to prep.
Good job setting up the priority queue InsertionIterator
-- I'm fine leaving it as a stopgap for this PR but I'd really like to get the radix sort algorithm fixed up and running soon after we merge this.
I think I'm happy with the current state, but please let me know if you see any more glaring changes I should be making. It will be a little bit of a journey attempting to extract insertion beds from ArchR, while removing the hardwired pseudobulk replicates and tn5 bias steps. To make sure I can continue progressing, I suggest that I table creating a testcase from comparing outputs from ArchR. Getting ArchR to skip those steps will require a decent amount of software engineering time by itself. Would that be reasonable?
Ok it should be ready! I added in the test comparing Archr and BPCells results. The only change required is setting devtools::load_all()
to target your bpcells/archr install.
Overall, I like how this is going and fundamentally it's quite close to being ready. A lot of the details I asked for from the last round are fixed. I like the ordering you chose for the call_macs_peaks
parameters, and making write_insertion_bed
have a matching interface to write_insertion_bedgraph
. The performance testing and comparison against ArchR outputs I imagine were tricky and time-consuming, but good job on them.
There are several more detail-oriented issues remaining though, several from my last round of suggestions which were either forgotten or only partially addressed. I've also included a few new ones that came up as I was able to try out the function more.
Sorry for having taken so much of the week to get back to you on this, I hope it hasn't been too much of a blocker. I'd be happy to do a call early next week to discuss any questions/clarification about this round of comments.
New comments:
Description
Add macs2/macs3 peak calling Utilize a multi-threaded approach separated by cluster within
prep_macs_inputs()
. Allow users to choose a specific step to run (ie "prep-inputs", "run-macs", "read-outputs") to allow for executing macs calls in external slurm cluster.Also change behaviour in
InsertionIterator
to not keep insertions in memory after moving to next fragment.Tests