canlab / CanlabCore

Core tools required for running Canlab Matlab toolboxes. The heart of this toolbox is object-oriented tools that enable interactive analysis of neuroimaging data and simple scripts using high-level commands tailored to neuroimaging analysis.
https://canlab.github.io/CanlabCore/
Other
149 stars 80 forks source link

Rezipping in fmri_data may be unnecessary #60

Open jcf2 opened 1 month ago

jcf2 commented 1 month ago

Currently fmri_data will uncompress .nii.gz inputs automatically using gunzip_image_names_if_gz:

[image_names, was_gzipped] = gunzip_image_names_if_gz(image_names);

This allows it to read .nii files. Depending on MATLAB gunzip and whether it deletes the original, this would result in either two files (.nii and .nii.gz) or just the .nii version. As housekeeping, fmri_data apparently assumes the latter as the worst case and rezips:

    % re-zip images if they were originally zipped
            % add .gz back to file names.
            if any(was_gzipped)

                image_names = re_zip_images(image_names, was_gzipped);

            end

where re_zip_images is a subfunction of fmri_data that calls MATLAB's gzip. This gzip call overwrites a .nii.gz if it exists, and re_zip_images then removes the .nii:

            gzip(deblank(image_names(i, :)));
            delete(deblank(image_names(i, :)))

This approach is unfortunately extremely costly, as gzip may take around 10x as much time as gunzip, and would seem to be totally unnecessary if the .nii.gz was in fact not deleted by gunzip. Also, it would seem that a replacement of .nii.gz files may invalidate cryptographic hashes calculated on the original.

So, one simple improvement would be to make the "housekeeping" step conditional, and if a .nii.gz exists remove the .nii instead.

In fact, the documentation for MATLAB's gzip (R2024a) says

gunzip does not delete the original GNU zip files.

and this seems to be true as well for past versions, so while OK to have that conditional "just in case" it seems rezip should be 100% avoidable here?

jcf2 commented 1 month ago

Note: the original approach was apparently to use the host system's gunzip

% [status, result] = system(['gunzip ' my_image]);

and there deletion of the original .gz would be the default (overridable by --keep). So that explains where the "worst case assumption" above comes from...

jcf2 commented 1 month ago

(oops, didn't mean to close this prematurely above... doh)

The above PR is designed to address efficiency related to zip with a minimal change -- just avoid an expensive and unnecessary call to gzip in fmri_data if the target already exists.

A separate issue that could be fixed by a deeper change is that fmri_data will fail on a readable .nii.gz if the directory is not writable (with a an obscure "internal error has occurred" originating from MATLAB gunzip). A fix that covers that as well could be to change the unzip structure:

  1. create a temporary directory: gunzip_temp = tempname;
  2. unzip there instead of in .gzip directory: gunzip(<target>, gunzip_temp);
  3. when the time comes for housecleaning, remove the temporary directory

Though in a way conceptually simpler, this approach would affect gunzip_image_names_if_gz as well as fmri_data (thus possibly affecting code beyond fmri_data), and would require passing the gunzip_temp location. Presumably in fmri_data it would also call for some renaming/restructuring, as re_zip_images would no longer be an accurate name.