broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
271 stars 50 forks source link

Azurize CellBender to run on ToA #367

Closed aawdeh closed 1 week ago

aawdeh commented 3 weeks ago

We don't have GPU access in ToA and we want to run Multiome (with CellBender) on Azure.

cellbender_remove_background_azure.wdl is a copy of cellbender_remove_background.wdl without the CUDA options for CellBender

Azurized CellBender so that it runs in Multiome by (1) removing the --cuda option as a CellBender input parameter and (2) removing runtime variables for GPU access when on Azure.

sjfleming commented 1 week ago

Looks good to me! Thanks.

sjfleming commented 1 week ago

One other thing to mention is that, per Janet Gainer-Dewar, the use of preemptible machines on Azure might not make sense currently. It sounds like there is no support for "checkpointFile" on Azure yet.

So if you run with hardware_preemptible_tries > 0, then you risk never actually finishing. CellBender will not pick up from its last checkpoint (since Terra on Azure doesn't support "checkpointFile"), and Janet said that they do not retry on a non-preemptible machine for the last run. So it might just fail if it continues to get preempted. I just worry about it a little because the run is bound to take a long time on CPU.

aawdeh commented 1 week ago

One other thing to mention is that, per Janet Gainer-Dewar, the use of preemptible machines on Azure might not make sense currently. It sounds like there is no support for "checkpointFile" on Azure yet.

So if you run with hardware_preemptible_tries > 0, then you risk never actually finishing. CellBender will not pick up from its last checkpoint (since Terra on Azure doesn't support "checkpointFile"), and Janet said that they do not retry on a non-preemptible machine for the last run. So it might just fail if it continues to get preempted. I just worry about it a little because the run is bound to take a long time on CPU.

That makes sense. I could set hardware_preemptible_tries to 0 in the Azure run. What do you think? I can go ahead and make that change.

I was also wondering if it would make sense to also remove maxRetries from the runtime variables?

sjfleming commented 1 week ago

Setting hardware_preemptible_tries to 0 by default sounds like a good idea to be conservative. Yeah maxRetries > 0 is usually used on GCP to overcome its PAPI error code 2 when it fails (inexplicably) to install GPU drivers. Retrying makes sense in that one limited case, since a retry can succeed. But most of the time, a retry does not make sense... if it was a cellbender failure, it will fail again on retry. But I guess it makes sense to leave that argument with a 0 default.

aawdeh commented 1 week ago

Looks good, I will merge

Thank you!