ebecht / infinityFlow

25 stars 8 forks source link

Running multiple imputations with same temp folder appends files #2

Closed NKInstinct closed 3 years ago

NKInstinct commented 3 years ago

Hi Etienne - I think I've found a bug in the infinity flow pipeline and I wanted to bring it to your attention.

The problem arises when I run multiple imputations with infinity_flow (on different data each time) and don't specify a path_to_intermediary_results arg. It looks like all of the different runs use the same temp folder for their intermediary results, but instead of overwriting a previous run's temp results, the pipeline is appending them.

This results in each imputation having all the cells from all the previous imputations saved to it as well. For example, if I run three imputations and have it set to down-sample to 10,000 cells for each imputation, the first resulting fcs file has 10,000 cells (as expected), but the second has 20,000 and the third has 30,000.

This problem happens whether I run the multiple imputations using purrr, a for loop, or even manually copying and pasting each run, so I don't think it's a problem with how I'm running multiple imputations.

The easy work-around is to always specify a path_to_intermediary_results, which completely fixes the problem (and highlights the suspected culprit), so it's a very minor problem. I thought I should let you know in case others are wondering why their runs are taking progressively longer and their files are getting progressively bigger if they run more than one imputation in a session.

Thanks for making such a great tool!

-Andrew

ebecht commented 3 years ago

Hi Andrew,

Many thanks for taking the time to investigate the bug and report about it.

I think the easiest way to fix this would be to check subdirectories within tempDir() with a while loop to make sure we get a folder that doesn't yet exist and create the first non-existing directory encountered. That would solve the issue you mention I believe.

As you noticed in the meantime you can specify this argument yourself to momentarily fix your issue but I'll make sure to fix that for others as well. Also note that restarting R should give you a new tempDir() so if you run the pipeline with a dedicated session for each dataset that should also work.

Many thanks and I hope this is useful otherwise! Best, Etienne

NKInstinct commented 3 years ago

Thanks for looking into it, and yes - everything is great otherwise!

Best,

-Andrew

ebecht commented 3 years ago

Hi Andrew,

Thanks again for reporting. This should be now fixed on the github version and later on in Bioconductor. Feel free to let me know if you run into other issues!

Best, Etienne