aoles / EBImage

:art: Image processing toolbox for R
71 stars 28 forks source link

Using on-disk array library for @.Data slot? #58

Open yeyuan98 opened 2 years ago

yeyuan98 commented 2 years ago

Hi,

Thanks a lot for this great package. I noticed that most of the memory footprint comes from the data slot which is a base R array.

Every time we do an image operation directly on Image or AnnotatedImage will lead to a full copy of the data and doubles memory usage. For example, let's say we have an Image object img. Then, operations like img * 50 will lead to double of memory usage (even without assign the result to anything!)

Also interestingly, display function will give even more memory usage than double. For example, operations like display(img) will typically give around 200% memory bump for me.

I am assuming that this behavior is caused by the usage of base R array, maybe somewhat mentioned in Issue#40 already.

Could we use on-disk array libraries to replace the current base R array?

For example, HDF5Array seems to be a reasonable replacement. I did a bit research and it seems that 1) it is well-supported by Bioconductor core team, 2) it allows easy conversion of base R array to an on-disk temporary HDF5 array (as simple as hdf5array <- HDF5Array(base.R.array), 3) it chunks the array into small pieces for fast access and only loads the relevant chunks into memory when needed.