SystemsGenetics / GEMmaker

A workflow for construction of Gene Expression count Matrices (GEMs). Useful for Differential Gene Expression (DGE) analysis and Gene Co-Expression Network (GCN) construction
https://gemmaker.readthedocs.io/en/latest/
MIT License
33 stars 16 forks source link

What is the deal with the GEMmaker staging directory #190

Closed bentsherman closed 3 years ago

bentsherman commented 3 years ago

I've been testing GEMmaker a lot on Palmetto recently, and one thing that keeps tripping me up with the main pipeline script is how it is affected by the special staging directory located in $NXF_WORK/GEMmaker. I understand how GEMmaker uses this directory to control how many samples are being processed at one time, but how does this directory supposed to behave across runs?

On Palmetto I have found that main.nf tends to behave weirdly on resumed runs and I think this staging directory is the root cause. Sometimes on a regular run, GEMmaker will finish all tasks and then just hang, never actually finishing. And then, on resumed runs, GEMmaker will mark the first few staging tasks as cached and then skip the entire workflow, producing no output or empty output files.

However I've found that I can just delete this staging directory and everything seems to work as expected. GEMmaker will complete without hanging, and on resumed runs it will still cache results from previous runs. So maybe we should add something about this to the troubleshooting docs? Or perhaps just have GEMmaker delete this directory on every run if it's not trying to cache it anyway. Thoughts @spficklin ?

spficklin commented 3 years ago

Hey @bentsherman. I've not experienced this behavior with GEMmaker before except, I think, when I have accidentally forgotten to remove the demo data that comes with GEMmaker. I do think we need to change GEMmaker so that by default it doesn't use the demo data because that causes issues if someone forgets to do the cleanup.

bentsherman commented 3 years ago

@spficklin I think this problem usually happens for me when I change the input data. Like if I run the example data, stop it halfway through or it fails, then start a new run with the Ath26k data. I think that messes things up because there would still be lock files from the example run sitting in these staging directories, which might cause GEMmaker to hang indefinitely. I don't know if this is the only cause though.

spficklin commented 3 years ago

Yes, I believe that would do it.

bentsherman commented 3 years ago

I think I have found another cause for staging directory weirdness. There is a retry_ignore profile that I use with the e2e pipeline so that I can get through a workflow run even if a few samples fail, because a few always do for one reason or another. But now I am also trying this out with the regular pipeline and I think the staging directory does not work seamlessly under this condition.

So right now I'm processing the first 1000 arabidopsis runs. Five samples failed at the download_runs step, and then all of the other samples went all the way through. So now the workflow is basically done but it's just sitting there doing nothing (see code block below). Meanwhile, there are five sample files sitting in work/GEMmaker/process. I think GEMmaker is hanging because it will not finish until the process directory is empty but it doesn't know what to do with those sample files since they failed.

Obviously I'm not very familiar with all of the scaffolding code that manages this directory, but I do think it's useful to have this retry_ignore profile because otherwise you kinda have to babysit large runs and that defeats the whole point of using Nextflow in the first place. @spficklin do you see an easy way to modify the staging logic so that failed samples are handled properly when Nextflow is configured to ignore them?

[bc/d0853a] process > retrieve_sra_metadata (1)               [100%] 1 of 1 ✔
[97/50762e] process > write_stage_files (ERX1659590)          [100%] 757 of 757 ✔
[fd/2f47bc] process > start_first_batch                       [100%] 1 of 1 ✔
[46/1a5e8a] process > read_sample_file (DRX092373.sample.csv) [100%] 757 of 757
[a4/f90288] process > next_sample (752)                       [100%] 752 of 752
[82/ae15ab] process > download_runs (DRX092373)               [100%] 757 of 757, cached: 197, failed: 5
[32/02c56a] process > fastq_dump (DRX092560)                  [100%] 752 of 752, cached: 193
[c6/9ad1bc] process > fastq_merge (DRX092560)                 [100%] 752 of 752, cached: 191
[c8/51767e] process > fastqc_1 (DRX092560)                    [100%] 752 of 752, cached: 187
[-        ] process > kallisto                                -
[-        ] process > kallisto_tpm                            -
[-        ] process > salmon                                  -
[-        ] process > salmon_tpm                              -
[10/660096] process > trimmomatic (DRX092563)                 [100%] 752 of 752, cached: 158
[81/550975] process > fastqc_2 (DRX092563)                    [100%] 752 of 752, cached: 155
[e1/c69806] process > hisat2 (DRX092563)                      [100%] 752 of 752, cached: 146
[a1/82992a] process > samtools_sort (DRX092563)               [100%] 752 of 752, cached: 144
[eb/6b6b70] process > samtools_index (DRX092563)              [100%] 752 of 752, cached: 144
[11/03e319] process > stringtie (DRX092563)                   [100%] 752 of 752, cached: 137
[10/962bc9] process > hisat2_fpkm_tpm (DRX092563)             [100%] 752 of 752, cached: 135
[-        ] process > multiqc                                 -
[-        ] process > create_gem                              -
[05/979eca] process > clean_sra (DRX092560)                   [100%] 752 of 752, cached: 193
[61/769989] process > clean_fastq (DRX092560)                 [100%] 752 of 752, cached: 191
[80/754b83] process > clean_merged_fastq (DRX092563)          [100%] 752 of 752, cached: 146
[f3/1f048c] process > clean_trimmed_fastq (DRX092563)         [100%] 752 of 752, cached: 146
[b3/fc05d3] process > clean_sam (DRX092563)                   [100%] 752 of 752, cached: 144
[9d/97264f] process > clean_bam (DRX092563)                   [100%] 752 of 752, cached: 135
[-        ] process > clean_kallisto_ga                       -
[-        ] process > clean_salmon_ga                         -
[89/3602b9] process > clean_stringtie_ga (DRX092563)          [100%] 752 of 752, cached: 134
bentsherman commented 3 years ago

I guess it hangs because multiqc and create_gem are waiting on all of the samples to finish before they can run.

spficklin commented 3 years ago

I believe GEMmaker was originally set to fail. The idea was that if a sample failed the user should be expected to figure out what went wrong and if was something they could resolve then the workflow could be restarted. On restart, GEMmaker will clean out the stage directory each time. I guess if people don't care if a sample doesn't make it all the way through then a retry ignore would be fine, but you're right the code to work with the staging directory can't handle that.

spficklin commented 3 years ago

Hey @bentsherman we just merged in pull request #204 . I think the issues fixed by this PR will resolve some of the issues you had on Palmetto. Give it a try and let us know.

bentsherman commented 3 years ago

Actually I meant to close this one earlier, since the troubleshooting docs were updated at some point with the fix that worked for me. Thanks for letting me know.