Fix hdf5 tile data after animation stops

pford commented 7 months ago

Description

What is implemented or fixed? Mention the linked issue(s), if any. Fixes #1368
How does this PR solve the issue? Give a brief summary. By default, image quality is decreased during animation for better speed. For each channel, tile data for HDF5 images is requested from the tile cache, encoded so that NaN values may be restored, and then compressed. However, the encoding step changes the cached tile data (passed with a shared_ptr) by replacing a NaN values in a block with the average of finite values in that block. When the cached tile data is requested again after animation stops, for image data with a higher compression quality, the image data is distorted since some values are no longer NaN. To fix this, the tile data is assigned to a new vector for encoding and compression, leaving the cached data unchanged.
Are there any companion PRs (frontend, protobuf)? No
Is there anything else that testers should know (e.g. exactly how to reproduce the issue)? Zoom in an HDF5 image to an area with NaN values, then start and stop animation. There should be no defect in the image after the animation is stopped.

Checklist

[x] changelog updated / ~no changelog update needed~
[x] e2e test passing / ~corresponding fix added~ / ~new e2e test created~
[ ] ICD test passing / corresponding fix added / new ICD test created
[x] ~protobuf updated to the latest dev commit~ / no protobuf update needed
[x] ~protobuf version bumped~ / protobuf version not bumped
[x] added reviewers and assignee
[x] added ZenHub estimate, milestone, and release

pford commented 7 months ago

Not sure if this should be code reviewed or tested first since it involves copying data, which we try to avoid.

confluence commented 6 months ago

In this case I think copying is unavoidable -- the compression function should not be modifying data in the tile cache!

I do remember adding a tile pool to the tile cache to avoid constantly creating and destroying vector objects, since this was causing quite a significant slowdown. I wonder if a tile pool would be appropriate here as well (I think that we should test the performance impact of this change first).

If yes, then we could either create a new tile pool to be used by this function, or add a GetCopy function to the tile cache which performs the copy internally and uses the tile cache's own tile pool (we may then need to adjust the tile cache pool size when we start and stop the animation, but maybe the tile requirements aren't going to exceed the default size, which IIRC is enough to fit one row and one column of 2x2 tile chunks).

confluence commented 6 months ago

Actually, now I see that the current change is in FillRasterTileData, which means that the copy is being performed regardless of how the data was obtained in GetRasterTileData (which means that it will also affect cases where the data does not need to be copied). It's GetRasterTileData which should be modified -- there's already a new vector which is created there (for the cases which don't use the tile cache), which could be used to copy the cached data.

Perhaps adding a new tile pool in GetRasterTileData, and using it in all cases instead of creating a new tile vector, would be a good idea.

pford commented 6 months ago

@confluence I added TileCache::GetCopy. If I added a TilePool to Frame, wouldn't it have the same problem as the TileCache pool?

confluence commented 6 months ago

@confluence I added TileCache::GetCopy.

Currently GetCopy is creating a new vector. I was originally suggesting using the tile cache's pool to recycle a vector object to use for the copy (i.e. using _pool->Pull() to get a vector pointer).

If I added a TilePool to Frame, wouldn't it have the same problem as the TileCache pool?

I'm not sure what problem you mean here.

I think I've confused the issue by conflating multiple suggestions -- for clarity, I'm talking about two separate things:

In GetRasterTileData we already create a new vector to read data into, in the cases that don't use the tile cache. I would suggest that we use this vector as well in the tile cache case, for consistency, instead of creating a vector in a different place. Perhaps the GetCopy function should be called CopyInto and take that vector as an input/output parameter -- but maybe this function isn't necessary at all, and the copy can just be inlined in GetRasterTileData. (I originally suggested the tile cache function to enable the use of the tile cache's pool, before I realised that the tile cache case could be integrated more generically with the other cases.)
Additionally, as a performance improvement, I think it would be a good idea to use a tile pool for that new vector in GetRasterTileData, instead of repeatedly creating and destroying it. My updated suggestion for this is to add a separate tile pool object to the frame, to be used here in all the cases, not just the tile cache case. I haven't checked if there are other places in the frame where a new tile vector is created (which could make use of the same pool).

confluence commented 6 months ago

@pford I was suggesting to use only a tile pool in the frame for all the tiles, not the tile cache -- we only added the full-resolution tile cache to HDF5 files because in the HDF5 schema the main dataset is chunked, which makes it efficient to read a chunk (which is 2x2 tiles) at a time. For FITS files and other formats, reading a chunk at a time is not efficient, which is why we still use the full channel cache for those formats. I don't think it makes sense to copy tiles read from the full channel cache into the tile cache (this is currently bypassing the tile cache's tile pool, and I believe that downsampled tiles are also being saved -- this cache is only written to handle full-resolution tiles).

I am suggesting only that the frame gets its own tile pool object (separate to the tile cache's tile pool), and uses that pool to recycle the tile pointers (instead of creating new pointers with newly allocated data). This wouldn't be caching any tile data; just keeping a buffer of reusable objects allocated in memory. They would be overwritten with new data every time they are used (so it would be fine to use them for all tiles in GetRasterTileData, whether they're downsampled or not).

So the frame would have a std::shared_ptr<TilePool> _pool, with some reasonable capacity (how many tiles are likely to be allocated at the same time during an animation?), and in GetRasterTileData instead of creating a new tile_data vector and later a new pointer with make_shared, we'd get a pointer from the pool with _pool->Pull(), and write to that pointer's data. The pool class handles the recycling internally (with a custom deleter).

So it would be just this small change, plus the copying in the tile cache case.

pford commented 6 months ago

I added a TilePool to Frame with the capacity set to the number of omp threads, since Session fills the tiles in parallel using omp. The TilePool is created the first time tiles are requested, since tiles are not used when Frame holds the downsampled image for PV preview.

github-actions[bot] commented 6 months ago

Package	Line Rate	Health
src.Cache	72%	➖
src.DataStream	44%	➖
src.FileList	67%	➖
src.Frame	36%	❌
src.HttpServer	42%	➖
src.ImageData	28%	❌
src.ImageFitter	83%	✔
src.ImageGenerators	43%	➖
src.ImageStats	75%	✔
src.Logger	37%	❌
src.Main	52%	➖
src.Region	69%	➖
src.Session	4%	❌
src.Table	52%	➖
src.ThreadingManager	67%	➖
src.Timer	85%	✔
src.Util	40%	➖
Summary	46% (8614 / 18797)	➖

confluence commented 6 months ago

@kswang1029 in addition to testing the bug fix, please check the performance impact for animation (of both HDF5 files and other file types). I expect this to slightly decrease performance for HDF5 files but also slightly increase performance across the board, and I'd like to find out what the overall impact is (and to make sure that something unexpected isn't happening).

confluence commented 6 months ago

@kswang1029 @pford in that case I'm happy to merge this.

kswang1029 commented 6 months ago

@kswang1029 @pford in that case I'm happy to merge this.

the effect (if any) may be more prominent using "large" cube but I do not have the resources to test with, unfortunately.

CARTAvis / carta-backend

Fix hdf5 tile data after animation stops #1371