Autosave of the last method result into the file

dkazanc commented 2 months ago

Having some doubts that we should save the result of the last method automatically into the hdf5 file.

The user might not need the hdf5 file and rather prefer tiffs or/and a binary file (TODO) instead
The reconstruction (being the last method) result is always saved as floating point array into hdf5 while it can be uint16 after rescale_to_int performed on it. Float32 is usually a redundant precision for further analysis anyway (e.g. segmentation). The users rescale themselves to run some Otsu stuff on it.

More on the second point. I think it would be more sensible if we auto-generate a template for reconstruction methods with the flag save_result: Falseadded to it and let the user decide if saving into hdf5 is needed.

With respect to the precision it worth to think about it. Basically we don't need the result of the reconstruction to be saved but rescale_to_int instead. This makes me think if we need an additional HTTomo parameter, e.g. saved_data_type: uint16 which will trigger global stats collections and then rescaling on the given array.

The logic behind it is the following: if we're saving the rescaled data into tiffs and the resulting hdf5 should be in uint16 as well, then we can save some time and space on saving floating point arrays (if its needed at all). Secondly the reconstructed hdf5 files in uint16 can be well compressed I believe. Blosc should work quite well on them as they possibly sparser than the projection data, especially after denoising!

Ideas? @yousefmoazzam

yousefmoazzam commented 2 months ago

Thanks for @'ting me :slightly_smiling_face:, here's some of my initial thoughts:

Any place we can avoid writing to a file is nice, so I'm onboard with that if it makes sense. We started with the assumption that users would more often than not want the final method's output written as an hdf5 file, but if that assumption doesn't hold, then I see no reason for the default behaviour to be to write an hdf5 file after the final method.
If it's indeed the case that 32-bit floating point is excessive precision for what users would want to do more often than not in "post-processing", then it sounds like it makes sense to favour the most commonly used lower precision.
I'm not overly familiar with the situation when both precision is reduced from 32-bit to 16-bit and also when the values are cast from floating point to unsigned integer:
- so it may well be a good idea for us to confirm nothing unexpected can happen regarding weird numerical errors due to precision loss + casting both happening
- or equally, if weird things can occur if done incorrectly, but it can also be avoided by doing it "the correct way", then it'd be good for us to be aware of how to do it the correct way!
I haven't really thought through how the parameter saved_data_type would work, but something which does spring to mind is that we might need to be mindful how this would be implemented. Currently, I think we have that global stats calculation + wrapper creation is triggered by something after the intermediate dataset saving wrapper (ie, save_to_images has a glob_stats: True parameter, and this comes after the intermediate dataset saving wrapper that is automatically inserted on the user's behalf by httomo)
- With saved_data_type , the description of it given sounds like it would be on the method before the intermediate dataset saving wrapper, is that right? But then this parameter can also trigger global stats calculation, which is something that is currently done after the intermediate dataset saving wrapper, in save_to_images
  - So, saved_data_type goes on the method before the intermediate dataset saving wrapper, but global_stats can come on methods after the intermediate dataset saving wrapper -> at first glance it sounds confusing (maybe it's the right design or maybe it's not, I'm just pointing out it sounds a bit odd at a first glance)
- Not to mention also that triggering of global stats calculation typically requires generation of a side output (otherwise it cannot be referenced/used by another, which would make the calculation of it pointless), and referring to a side output causes a "section cut".
- I'm sure there's a way to design it nicely (eg, insertion of stats calculating wrapper being triggered by one thing/param, same for intermediate dataset saving wrapper), my point is only that I think we need to put a bit of thought into it first, because it does sound like it could get fiddly.
With the mention of compressing hdf5 files that contain the recon: is this meaning compressing the entire hdf5 file after it's been written, or compressing the individual chunks within the file when writing the chunks?
- If it's the latter, just a note that we'd then need to investigate the segfault currently occurring in the improve-intermediate-file-performance branch when compressing the chunks when saving the recon data for parallel runs.
- Maybe the precision loss and/or casting from float to int would magically get rid of the issue, but maybe not. If not, that'd need to be fixed before the compression could be done.

dkazanc commented 2 months ago

Thanks Yousef, I think it needs more thinking/discussion. I just realised that we actually got the parameter save_result_default in the library file which can control the result of reconstruction saved, for instance. This is in addition to saving the last result of the pipeline, which make things pretty dubious. I think we need to look into that before the release. Just do this:

Switch off the last method saving rule
check that the library's save_result_default controls the saving
and switch off the default saving of hdf5 files for recon methods. We can try it for the release and get the feedback from the people.

dkazanc commented 4 weeks ago

So I looked into this:

The result of the last method is actually NOT saved automatically. I was confused because of the result of reconstruction is always saved by default driven by save_result_default: True in the library file.
save_result_default works as expected and save_result: False cancels it when used in the pipeline.

This all mean that nothing should be done here actually, as one can cancel the saving of the reconstructed volume, if needs to. I'm also OK if the result of the recon is saved by default for now. If there will be a request to turn it off it can be done easily using save_result_default: False.

DiamondLightSource / httomo

Autosave of the last method result into the file #283