CONFIG.JSON does not allow for re-run of pipeline with altered sample list

CCBR / RENEE

A comprehensive quality-control and quantification RNA-seq pipeline

https://CCBR.github.io/RENEE/

MIT License

4 stars 4 forks source link

CONFIG.JSON does not allow for re-run of pipeline with altered sample list #108

Closed slsevilla closed 9 months ago

slsevilla commented 9 months ago

Problem

There was a corrupted sample in my project. I went and removed that sample from the input dir, which I had been giving to RENEE. I then attempted to resume the pipeline without the sample and the pipeline is failing because the FQ file is missing.

The reason for this is that the config.json file has my sample info included. Because this is not being re-generated, the project will only ever run the samples included in the initializtion.

Solution

Without having a major re-write, I'm thinking the easiest solution is to have an option where we can call a re-run command or something similar? Alternative would be to create a new json each deployment but I have concerns with that, since users could alter portions of that file (another 'wish list' would be to get rid of this file all together).

Thoughts @kelly-sovacool @kopardev

kelly-sovacool commented 9 months ago

Maybe we can figure out how to fix this without adding a new subcommand?

Could we just document somewhere that if you run renee again from the same directory after removing a sample, you'll also need to manually edit the config.json?

slsevilla commented 9 months ago

ideally: you would do nothing other than change the inputDIR or the input manifest file (this is how CARLISE and ASPEN are set up) next ideal: you tell it to recreate the json file with a subcommand next ideal: you edit the json yourself

The json file would require users to edit three locations (albeit they are close to one another) for every sample add/removal. I'm worried that if a user makes an error in the json file that they won't be able to recover from it, since there is no way to recreate the file without re-initalizing. What I imagine is going to happen is that users would just re-run the whole pipeline which seems wasteful, but probably easier?

kelly-sovacool commented 9 months ago

Oof, this really underscores how much we need to get rid of config.json in our future redesign.

Another question: does resuming the snakemake workflow from this stage have any benefits? e.g. would it prevent rerunning certain steps of the pipeline that run for each sample (e.g. adapter trimming), or would it have to rerun those steps anyway?

slsevilla commented 9 months ago

Yes, in an ideal world.

Ideal: You tell Snakemake to resume. Snakemake looks at the file list (given from the updated json file) and it thinks "I forgot this sample" and deploys only the rules which are needed to recreate the final outputs. It would run trimming on SampleNew, but not on SampleOld. It would re-run any step downstream though that has an input of SampleNew AND SampleOld though, like the RNA Report, which is what we want.

Current (add a new sample): You tell Snakemake to resume. Snakemake looks at the file list (still the old json file) and it thinks "I'm done", so nothing happens.

Current (delete a sample): You tell Snakemake to resume. Snakemake looks at the file list (still the old json file) and it thinks "there's a sample missing here - ERROR!" which is what happened in my actual use case. A sample was corrupted, I deleted that sample and re-ran, it errored because the sample it expected wasn't there. In my case about 50 jobs had already completed and I didn't want to start from scratch.

slsevilla commented 9 months ago

Also a note: I'm not sure where in this script, but the pyparser is still somehow grabbing the deleted sample and passing it to the final output file.

The pipeline failed at the RNA Report because the values for the deleted sample are zero, and therefore it's having an issue with determining variance. Feels like this is a larger issue since the python script is just grabbing anything in output directories, rather than having this information controlled by Snakemake. Might be a separate issue, depending on how we address this one...

kopardev commented 9 months ago

@slsevilla , I believe Skyler's logic is too convoluted and this problem is encountered rarely. Plus, this will be fixed with the eventual transition to NF where we will use a manifest approach rather than grab all FQs willy-nilly. IMO, we should:

acknowledge the bug in the documentation and add to a plausible solution (NOT manually edit the config file as @kelly-sovacool suggested [too much for an average user], but just re-run from scratch without the unwanted sample(s)... that is much cleaner and these are RNAseq samples wont take too long)
not make any effort on fixing it as it will not even exist in the NF implementation.

kelly-sovacool commented 9 months ago

@kopardev I created a label wait_for_nxf just now so we can keep track of issues that are specifically on the back-burner because we will fix them when we reimplement the pipeline in nextflow. What do you think?