AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Add script to update checkpoint directories for scpca-nf #171

Closed jashapiro closed 1 year ago

jashapiro commented 1 year ago

This PR, currently in draft form, is designed to complete https://github.com/AlexsLemonade/ScPCA-admin/issues/408. To do so, the script here performs two tasks: moving checkpoint files from an old to new location (as we changed from internal to checkpoints directories default) and adding scpca-meta.json files to the checkpoint directories as needed.

In the current form, it processes scRNAseq samples (rad files) and vireo results (which already have scpca-meta.json files in all versions).

The basic idea is to generate the checkpoint directories in the same way scpca-nf (as of the future version 0.4, as I am calling it), then copy file contents from the previous location to the current one. I used the aws command line for this part, as boto3 doesn't have a built-in recursive copy or sync (and I didn't trust myself to write one), though I use boto3 for more atomic operations like checking if files exist and writing the json file.

For files that require the scpca-meta.json file, we generate all of the fields that would be created by the workflow, then write that file to the appropriate checkpoint directory.

Before doing either, it checks if there are files in the new location, and will not overwrite unless explicitly told to do so by an option.

Speaking of options, there are many, because there are a lot of things that are relatively constant, but could potentially change. I ended up putting a lot of these into options partly just so I could pass them around as part of the args dictionary.

One thing I don't do at the moment is really check which version of the workflow/nextflow was used at each stage, so everything at the moment gets the . We don't really have that in a file that can be extracted within the checkpoint directory itself, but I could look at the publish directory and pull it from the output json file. This is probably worth doing, and will be the next thing I work on.

After that, I will return to the bulk and spatial functions, which exist now mostly as stubs.

Note also that the current prefix argument sends us to our sort of working directory. When run for real, this will be changed to scpca-prod.

Please do let me know how you think this looks as a general direction, and if you see any other issues or places where the function of the code is unclear.

jashapiro commented 1 year ago

One thing I don't do at the moment is really check which version of the workflow/nextflow was used at each stage, so everything at the moment gets the . We don't really have that in a file that can be extracted within the checkpoint directory itself, but I could look at the publish directory and pull it from the output json file. This is probably worth doing, and will be the next thing I work on.

Update: this is now present for the scRNAseq data. It turns out the only thing we seem to have in the output is the workflow version (not nextflow), so that is all I am grabbing.

jashapiro commented 1 year ago

This script is now done, tested and run!