immanuelazn commented 3 months ago

Description

Add macs2/macs3 peak calling Utilize a multi-threaded approach separated by cluster within prep_macs_inputs(). Allow users to choose a specific step to run (ie "prep-inputs", "run-macs", "read-outputs") to allow for executing macs calls in external slurm cluster.
Also change behaviour in InsertionIterator to not keep insertions in memory after moving to next fragment.

Tests

Added test cases for prep_macs() parameterized by cluster existence, threading, and insertion_mode

bnprks commented 3 months ago

Here's some initial comments I came up with from a read-through. Overall looks like a pretty good structure and approach. I'm focusing on some higher-level comments here.

There are also several small things that I think could be tweaked around choices of naming, docs, default arguments, organization etc. Rather than dictate solutions to those while you're still drafting, I'll let you finish up and do your own polishing pass before we discuss those.

C++ comments:

I'm pretty sure that writeInsertionBedByPseudobulk should write out duplicate coordinates rather than explicitly skipping printing repeats
Once we get a chance to do a performance profile, we should check if gzprintf is taking up a bunch of time (and comparing with/without a .gz file ending to isolate if the compression or the printf part is slow)
- We might also want to take a quick look to see if macs3 can support any other kinds of compressed input that would be faster to write
- We will definitely need to debug the memory issue you identified in InsertionIterator as well

R comments

With the option to split up execution using step, I'm a bit concerned about errors that might happen if users edit the input folder between steps, or try calling a later step with arguments that are mismatched/incompatible with their earlier step
- This might mean making cell_names non-nullable, and maybe even fragments, though that might not be worth it given the usability trade-off.
prep-inputs should probably write the macs command to a .sh file (one-per-cluster). Also returning the commands is as a vector is fine, but later steps could then read from the .sh files to simplify
- Also we usually have a few more arguments present, including e.g. --shift. You might want to check the ArchR source a bit more for that as I think arguments are listed in at least two separate places
For mclapply we probably also want to always use the argument mc.preschedule=FALSE

immanuelazn commented 3 months ago

There are a few more tests to be added that directly compare writing insertions from fragments in ArchR vs BPCells. I might just copy the output files to eliminate the dependence of building an entire ArchR project.
Otherwise, I think I've addressed most of your comments, and the feature is in a relatively polished state.

bnprks commented 3 months ago

Looking pretty good! Most remaining comments are style-related, though I did end up with a large number of those in call_macs_peaks(). Trying out a bulleted list here in rough order of the code rather than using Github's inline comment feature since it's a bit easier for me to prep.

Good job setting up the priority queue InsertionIterator -- I'm fine leaving it as a stopgap for this PR but I'd really like to get the radix sort algorithm fixed up and running soon after we merge this.

InsertionIterator

- Very cool that you were able to compare performance between the priority queue and the sorting version! - Minor style note but `*(starts + i)` is the same as `starts[i]`

writeInsertionBedByPseudobulk()

- Given that this writes out just one pseudobulk, I'm not sure why both the `cells` and `group_of_interest` arguments are needed. Arguably, you could even just pass in `fragments` to already be subset to just the cells of interest from the R side with `select_cells()` - At least filtering prior to `InsertionIterator` will save some sorting work - Unclear if `keep_dups` argument is needed (more discussion below). This also might eliminate the need for the `cleanup at end of chromosome` section - Error messages shouldn't be prefixed with `writeBedgraph:` anymore - I'd suggest just calling this function just `writeInsertionBed` for simplicity. As a cleanup task I can edit `writeInsertionBedgraph` to just write one pseudobulk at a time anyhow as I think that will make the most sense - Either this file name will need an update or this function should go into its own file because it doesn't handle `Bedgraph` files. (I'd be happy to rename to maybe `BedWriter` since bedgraph is kinda just another type of bed file)

prep_macs_inputs()

- Does this need to be a separate exported function? Seems like it could mostly slot in to `call_macs_peaks` or at least be private

call_macs_peaks()

1. Make docs and parameter names consistent with `call_peaks_tile` (cell_groups and effective_genome_size arguments). Can use `@inheritParams` to save on re-typing docs if the text will be identical. - You might also consider copying the default initialization of `cell_groups`, though I'm not sure how that would work if we allow `fragments` to be NULL 2. `path` docs. Put directory structure in `@details` section. Specify inputs are `/input/.bed(.gz)` and shell commands are `/input/.sh` 3. `insertion_mode` docs don't quite read naturally. If you want consistency with `write_insertion_bedgraph` I'd update both to say maybe "for coverage calculation" rather than "for insertion counts calculation" 4. Do we need the `keep_dups` option? I was just looking through the ArchR code and I don't think it would ever de-duplicate (and as mentioned previously I think it's likely undesirable to remove duplicate insertion sites in a peak calling context). Though if your testing indicates otherwise we can discuss 5. `step` argument should default to `all` 6. `macs_version` could probably be `macs_executable` instead, and allow the user to pass a full path if desired. (And I'd say default to macs3 at this point, or make the default NULL and try to auto-detect macs2 vs macs3 with test `system()` calls) - Either way, we probably want to check if macs can be run before prepping any inputs in the case of `step = all` 7. `additionalParams` should be snake case like you have it in the docs. It also might be nice if the name made it clear that these are parameters getting passed to MACS - Still don't have a `--shift` argument being passed, which might make sense to add here 8. `quiet`: If we multi-thread `run-macs`, then we might want to write the macs stdout/stderr to a log file in `outputs/` and have this just print when we start writing each input file and when we start running each copy of macs (maybe also printing when each finishes?). Also, for consistency this should flip and be named `verbose` 9. `use_gz` docs would be helpful if they could give some absolute file sizes + time to write with /without compression for a small-ish example dataset (e.g. one of the public 10x pbmc datasets) 10. `threads` argument in `call_macs_peaks` should probably also be used in the `run-macs` section 11. Return value should be a single data frame, with an extra column specifying the group that the cluster was called in. Try to place the group column early in the ordering (probably either first or right before name?). A list of data frames is a little awkward to deal with as output 12. Will the `system2` call work properly when there are spaces in cluster names? Might want to add some quoting to `macs_call_template`. Also maybe get rid of any `/` character in a cluster names? 13. In docs, `bedfile` should be split up as two words for style

tests

- Looks reasonable to me. I like your tests that the insertion bed file has the right contents. If `prep_macs_inputs` stops being a public/separate function, you could adjust to just call `call_macs_peaks(step="prep-inputs")` in that test

immanuelazn commented 3 months ago

I think I'm happy with the current state, but please let me know if you see any more glaring changes I should be making. It will be a little bit of a journey attempting to extract insertion beds from ArchR, while removing the hardwired pseudobulk replicates and tn5 bias steps. To make sure I can continue progressing, I suggest that I table creating a testcase from comparing outputs from ArchR. Getting ArchR to skip those steps will require a decent amount of software engineering time by itself. Would that be reasonable?

immanuelazn commented 3 months ago

Ok it should be ready! I added in the test comparing Archr and BPCells results. The only change required is setting devtools::load_all() to target your bpcells/archr install.

bnprks commented 3 months ago

Overall, I like how this is going and fundamentally it's quite close to being ready. A lot of the details I asked for from the last round are fixed. I like the ordering you chose for the call_macs_peaks parameters, and making write_insertion_bed have a matching interface to write_insertion_bedgraph. The performance testing and comparison against ArchR outputs I imagine were tricky and time-consuming, but good job on them.

There are several more detail-oriented issues remaining though, several from my last round of suggestions which were either forgotten or only partially addressed. I've also included a few new ones that came up as I was able to try out the function more.

Sorry for having taken so much of the week to get back to you on this, I hope it hasn't been too much of a blocker. I'd be happy to do a call early next week to discuss any questions/clarification about this round of comments.

Follow-ups still to be addressed from last comment round

`writeInsertionBed` - After removing the `keep_dups` argument, there's no reason to have the variable `last_base` and hence the end-of-chromosome cleanup step. `call_macs_peaks()` - `path` docs in details section. - For input files, I'd describe contents/purpose of the files. E.g just "shell commands" doesn't say what they are for. "Bed files" doesn't explain what the contents are. - For outputs, for the files BPCells doesn't use, we can just link to the macs3 documentation on output files. - Very useful to say we only read the `narrowPeaks` file, but you need an extra line break so it doesn't get swallowed into the last bullet when rendered as a webpage - In the docs for the `path` argument, you should add a (see Details for directory structure) or equivalent message - `macs_version` was not renamed in the function declaration, only in the docs. Also, using `match.arg` means that it's not possible for the user to pass a custom path for a macs executable - `use_gz` docs would be helpful to give numbers for the time trade-off as well. Most straightforward would be to give both time + input folder size for with and without compression. - `threads` argument in `call_macs_peaks` should probably also be used in the `run-macs` section - Return value "cluster" column would ideally not be the last column - Will the `system2` call work properly when there are spaces in cluster names? Might want to add some quoting to `macs_call_template`. Also maybe get rid of any `/` character in a cluster names?

New comments:

Verbose logging setup:

- `log_message` variable will not be defined in "run-macs" section if run standalone (and I also don't think it's necessary to include that information in the log file) - In `write_insertion_bed()`, the logging should be inside the `mclapply` call so it prints at the correct time. (Despite the theoretical race condition, I think this basically always works out fine even in very tight loops) - The style of log you use in `run-macs` is better also, rather than printing out cluster names on individual lines - Example of the current (sub-optimal) output ``` [1] "2024-08-23 16:10:15.508477 Writing bed file for clusters:" [1] "consonant" [1] "vowel" [1] "2024-08-23 16:10:15.532154 Finished writing bed files" ``` - I'd suggest `format(Sys.time())` to avoid having the fractional seconds get printed - I think using `message()` rather than `print()` will be better to avoid the `[1]` prefixes getting printed

run-macs and error checking:

- `macs_executable` now has a slightly weird combination of both auto-detect and a default value. I'd say the best option is to default NULL, and in the case of NULL we can do your auto-detect procedure. If not null, we just check that `macs` can be found else we error out - Also, the existing `stop_run` and `log_message` variables can probably be eliminated since I don't think there's a need / easy way to hang on to `log_message` for use in later steps. - The existing check on `macs_executable` should check the output of `macs3 --version` to make sure it contains the text `macs` -- otherwise the check will pass for almost any executable even if it's not macs - In `run-macs`, it's probably better for `system2` to write stdout/stderr directly to the log file so that logs are written as the command is running rather than all at the end. It's OK if this prevents writing the command as it appears MACS already prints the input command at the top of the log file - Would be good to check in `run-macs` that the command ran without error (had a 0 exit code)

Other:

- The test for `macs_e2e_works` now introduces a system dependency on having `macs2` installed. Can we make it skip instead if `macs` is not installed? - `--call-summits` probably can go in `--additional-params`, as it seems to just affect the peak calling method used by macs while leaving the output file format consistent - in `write_insertion_bed`, `cell_groups` should probably come after `path` for consistency with `write_insertion_bedgraph`, and `cell_groups` should maybe default to NULL for consistency.

Questions/discussion areas (can do via call/slack)

- Is there a better name possible for `use_gz`? I'm just realizing the name doesn't really give any hints about what is getting compressed, so it would be nice to have a name that would be more specific to the macs bed insertion inputs. Let me know if you have a suggestion here - What do you think of the adjusting column names/values in the output? Reading the [[UCSC narrowPeak format docs](https://genome.ucsc.edu/FAQ/FAQformat.html#format12)](https://genome.ucsc.edu/FAQ/FAQformat.html#format12), I think it's kinda weird that `pValue` is given as -log10(p), and peak/pointSource is given relative to `start`. On the one hand it's nice to maintain a direct correspondence with the macs output, but some parts of it are kind of non-intuitive

bnprks / BPCells

Macs peak calling #118

Description

Tests