Parallel processing errors when reading large bin_widths

naomicxia commented 5 years ago

Hello, I was wondering if you have tried to test the upper limit of how much you can increase bin_widths?

I was able to do motion correction on a 512x512x80000 .mat file using bin_widths of 10,000. I usually run this on a 40-core processor with 500G RAM. It takes about 2-3 hours with 12 parallel workers, which is called in the beginning with the gcp command and I do not enable memory mapping and also point to the path of file (without loading input .mat file into memory).

In general, the output seems motion corrected, however my videos have significant slow z-drift throughout. When I attempted to use bin_widths of 30,000, I run into several problems:

1) If I monitor the progress of the script, it appears to call call parpool several times (it calls 12 in the beginning and in the middle it also calls again, sometimes twice more), however the process is extremely slow and towards the end only a few cores seems to have a >0% load, even though the script technically called for more.

2) If I left the process run overnight, I get the following error:

Offset 4.0e-01 pixels due to bidirectional scanning detected. Registering the first 10000 frames just to obtain a good template......done. Template initialization complete. Now registering all the frames with new template. The file was saved successfully. Elapsed time : 1140.249 s. 30000 out of 82521 frames registered, iteration 1 out of 1..Starting parallel pool (parpool) using the 'local' profile ... connected to 12 workers. Starting parallel pool (parpool) using the 'local' profile ...

Error using parfeval (line 58) Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.

Error in normcorre_batch_even (line 255) future_results(i) = parfeval(@register_frame, 2, Yt,fftTempMat,fftTemp,patches,options);

Error in twostep_moco_applyshifts_clusterMAT (line 102) [~,shifts,template] = normcorre_batch_even(fullname_R,options_mc,template);

Caused by: Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 676) Failed to initialize the interactive session. Error using parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus (line 790) The interactive communicating job failed with no message.

Is it possible that there is no upperlimit to parpool being called and so it keeps calling it until it runs out and errors? Any tips would be greatly appreciated. Or if there is a better way to deal with drift that would be great too.

Thank you!

epnev commented 5 years ago

@naomicxia This looks like a cluster problem. Have you try searching online for this error? For example here?

On the other hand, may I ask why you use such a large value for bin_width? It is possible (although unlikely) that the cluster shuts down due to inactivity because you register so many frames at a time.

Also, what file format is the output of your motion correction? Tiff, hdf5, memory mapped file? Memory mapped will be very inefficient in this case, and I would highly recommend hdf5.

naomicxia commented 5 years ago

@epnev I was hoping that increasing bin_width would help me get a more static/stable template because I have z-drift in my videos, such that the FOV is slightly warped when comparing early frames to later frames? But I'm not sure if thats the best solution for z-drift...how do you typically deal with drift?

I use tif as an output format. The segmentation script only takes tif so I haven't tried with hdf5 yet...

Thanks I will look into the cluster problem more!

epnev commented 5 years ago

@naomicxia I would suggest trying out decreasing the bin_width and saving as an hdf5 file to make sure that the code works. Then you can try out saving as a TIFF. I don't see a big value in having such a large bin_width value.

Alternatively, you can split your file into multiple smaller files, e.g., 10k frames and correct them one by one using the template produced by file N, as initial template for file N+1. Lines ~ 24-40 here show an example on how to do this.

epnev commented 5 years ago

@naomicxia any updates on this?

flatironinstitute / NoRMCorre

Parallel processing errors when reading large bin_widths #26