JinghaoLu / MIN1PIPE

A MINiscope 1-photon-based Calcium Imaging Signal Extraction PIPEline.
GNU General Public License v3.0
56 stars 25 forks source link

HPC support #12

Closed ivosonntag closed 5 years ago

ivosonntag commented 5 years ago

Dear Jinghao, we started running min1pipe on our 'local' HPC (Heidelberg) and came across a few issues that I wanted to share now. I'm not sure if the Issues section is the right forum for this or if the individual problems should be split up eventually.

First there were some minor changes that needed to be made in the code.

  1. The min1pipe main function now has 2 additional input arguments for the path and the filename, thus omitting the need to manually specify them. (not possible on remote execution) In addition, the data_info function also had to be changed accordingly. (removing the uigetfile and adding path and filename as input variables)

  2. The isgpu variable set in the logdemons_unit function needs to be set to false. At least if the HPC node that will be used does not have a gpu.

Apart from those two points, only CVX needs to be compiled for the appropriate system running on the HPC.

The code runs usually fine, however I noticed that sometimes min1pipe crashed, always giving different error messages. All of them in the frame_reg function however. (Line 102 in the github min1pipe version) In one case re-running the script on a different node-type (more RAM) fixed the issue and I'm now re-running all datasets that caused a crash. On first inspection I can't see any obvious flaws in the datasets that could throw off the analysis. I also noticed that the nonstable_section function (especially nonstable-LogDemons) took a lot longer to compute in the datasets that crashed, compared to those that did not.

Here are the error messages:

1.

_Done nonstable-section, 10215.5737 seconds Begin inter-section ... Done data prep Done loop #1/6 Done loop #2/6 Done loop #3/6 Done loop #4/6 Done loop #5/6 Done loop #6/6 Done inter-section, 16805.9769 seconds {Error using frame_stab (line 30) Cannot index into 'reg' because indices cannot be empty.

Error in frame_reg (line 107) m = frame_stab(m);

Error in min1pipe_HPC (line 106) [m, corr_score, raw_score, scl] = frame_reg(m, imaxy, se, Fsi_new, pixs, scl, sigma_x, sigma_f, sigma_d);

Error in min1pipe_params (line 20) [fname, frawname, fregname] = min1pipe_HPC(Fsi, Fsi_new, spatialr, se, ismc, flag,file_name,pathname); }

2. _Done nonstable-section, 8245.2843 seconds Begin inter-section ... Done data prep Done loop #1/6 Done loop #2/6 Done loop #3/6 Done loop #4/6 Done loop #5/6 Done loop #6/6 [Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.] [> In parallel_function (line 599) In inter_section (line 167) In frame_reg (line 102) In min1pipe_HPC (line 106) In min1pipe_params (line 20)] [Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.] [> In parallel_function (line 599) In inter_section (line 167) In frame_reg (line 102) In min1pipe_HPC (line 106) In min1pipe_params (line 20)] [Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.] [> In parallel_function (line 599) In inter_section (line 167) In frame_reg (line 102) In min1pipe_HPC (line 106) In min1pipe_params (line 20)] {Error using parallel.internal.pool.serialize (line 10) Out of memory attempting to serialize data for transmission.

Error in distcomp.remoteparfor/addInterval (line 200) data = parallel.internal.pool.serialize(varargin);

Error in inter_section (line 167) parfor ip = 1: length(Yuse)

Error in frame_reg (line 102) m = inter_section(m, sttn, se, pixs, scl, sigma_x, sigma_f, sigma_d);

Error in min1pipe_HPC (line 106) [m, corr_score, raw_score, scl] = frame_reg(m, imaxy, se, Fsi_new, pixs, scl, sigma_x, sigma_f, sigma_d);

Error in min1pipe_params (line 20) [fname, frawname, fregname] = min1pipe_HPC(Fsi, Fsi_new, spatialr, se, ismc, flag,file_name,pathname); }

3. _Done nonstable-section, 7016.6009 seconds Begin inter-section ... Done data prep Done loop #1/6 Done loop #2/6 Done loop #3/6 Done loop #4/6 Done loop #5/6 Done loop #6/6 {Error using / Matrix dimensions must agree.

Error in batch_compute (line 9) nbatch = ceil(nsize / memo);

Error in inter_section (line 143) nbatch = batch_compute(nsize);

Error in frame_reg (line 102) m = inter_section(m, sttn, se, pixs, scl, sigma_x, sigma_f, sigma_d);

Error in min1pipe_HPC (line 106) [m, corr_score, raw_score, scl] = frame_reg(m, imaxy, se, Fsi_new, pixs, scl, sigma_x, sigma_f, sigma_d);

Error in min1pipe_params (line 20) [fname, frawname, fregname] = min1pipe_HPC(Fsi, Fsi_new, spatialr, se, ismc, flag,file_name,pathname); }

In addition, it would be nice to include a timestop for the prompts asking wether to overwrite the old data or not. This way one wouldn't have to manually delete the old data before re-submission in case a job didn't run through. I will update as soon as I know whether re-running the analysis on a different node-type fixed the errors, or maybe produced different ones.

Best, Ivo

JinghaoLu commented 5 years ago

Hi Ivo,

Thanks for all the suggestions. MIN1PIPE was initially not optimized for HPC usage, so sorry for the issues you came across. The changes you made for HPC version all look good, and you are welcome to contribute to the package.

For the issue during running, my overall guess is that these are caused by a mis-calculation in the memory usage for logdemons computation in frame_reg part, assuming you are talking about using cpu to do the movement correction section. The overflow caused all kinds of other issues then.

More info is still needed to reach a conclusion, so let me know when you have updates on the re-running.

Thanks,

Jinghao

ivosonntag commented 5 years ago

Hi Jinghao, the code ran fine in all cases on nodes with more RAM.

I also forked the repository and committed the changes noted above. I tested it on the demo dataset using a regular workstation using the gui prompt to pick the data and also on an HPC specifying the path within the demo script. Both worked.

Best, Ivo

JinghaoLu commented 5 years ago

Ivo,

I will take a look at the frame_reg of CPU version and find a better estimation of the RAM needed. Meanwhile, you are welcome to create a pull request.

Jinghao

ivosonntag commented 5 years ago

I created the pull request.