Open confluence opened 3 weeks ago
Package | Line Rate | Health |
---|---|---|
src.Cache | 72% | ➖ |
src.DataStream | 44% | ➖ |
src.FileList | 67% | ➖ |
src.Frame | 36% | ❌ |
src.HttpServer | 42% | ➖ |
src.ImageData | 28% | ❌ |
src.ImageFitter | 83% | ✔ |
src.ImageGenerators | 44% | ➖ |
src.ImageStats | 75% | ✔ |
src.Logger | 37% | ❌ |
src.Main | 52% | ➖ |
src.Region | 69% | ➖ |
src.Session | 4% | ❌ |
src.Table | 52% | ➖ |
src.ThreadingManager | 67% | ➖ |
src.Timer | 85% | ✔ |
src.Util | 40% | ➖ |
Summary | 46% (8632 / 18823) | ➖ |
I did a test with a directory with 20000 fits image files, and a directory with 20000 casa image directories. The text computer is a local build with a fast SSD. I did purge disk cache for each test case.
The test results (in seconds) are summarized in the following table:
We see improvements of ~2x for the casa image with the "filter by extension" and the "all files" modes. Is this expected in this improvement and consistent with what you tested and observed?
The psrecord plots are attached below as a record. The first plateau is casa image and the second is fits image.
old - content
new - content
old - extension
new - extension
old - all
new - all
It appears to me that once we access the context of a file, the CPU usage won't be 100% in my test case.
This is an attempt to improve the performance of generating a file list for directories with lots of subdirectories. The main performance bottlenecks in the current code are:
There are three options for displaying file lists: filtering by content, filtering by file extension, and no filtering. Filtering by content additionally slows down the file list generation because it reads the signature of each ordinary file to determine the type. Guessing the type from the file extension speeds up the processing of ordinary files, but not subdirectories, which are always checked with casacore's
ImageOpener
. Turning off filtering entirely has no additional performance impact on the image file list (because detection of directory-based images is still required), but speeds up the region file list (because if no filtering is required, all directories can be shown without additional checks).This PR is an attempt to add a less expensive heuristic for directory-based images, to be used when the option to filter by content is not selected. It performs almost the same checks as
ImageOpener
, but only for directories, by looking for files inside the directory withfs::exists
, and without distinguishing between different CASA image subtypes.The existing code assumes that there are directory image formats which we do not support, and handles them differently. However, it's clear from the
ImageOpener
code that the GIPSY format is a pair of files, not a directory (soImageOpener
would never return that type for a directory), and theCAIPS
andNEWSTAR
types are obsolete and never returned byImageOpener
(at all). So I have removed this option from the code, and not implemented it in the alternative code.The result: the alternative code appears to be slightly faster, but I don't know if it's faster enough for it to make sense to add it as an alternative to the casacore code. If this implementation is sufficient for our purposes (e.g. we don't need to read the
table.info
file because we don't need do distinguish between CASA sub-types here), then perhaps we should replace the casacore check with this (for a modest speed improvement in all cases).Other optimizations we discussed:
I think we're planning to use the last option as our long-term solution, and I would suggest applying that strategy to all files: instead of loading file information up-front when generating the file list, we could initially return just a bare list of files and directories, and then return information for lists of files and directories as the frontend requests them.
Checklist