CARTAvis / carta-backend

Source code repository for the backend component of CARTA, a new visualization tool designed for the ALMA, the VLA and the SKA pathfinders.
https://cartavis.github.io/
GNU General Public License v3.0
22 stars 10 forks source link

Optimisation of directory type checking in file list #1381

Open confluence opened 3 weeks ago

confluence commented 3 weeks ago

This is an attempt to improve the performance of generating a file list for directories with lots of subdirectories. The main performance bottlenecks in the current code are:

  1. Checking whether a directory is a known image format, and should be treated as a file,
  2. Counting the children of a subdirectory.

There are three options for displaying file lists: filtering by content, filtering by file extension, and no filtering. Filtering by content additionally slows down the file list generation because it reads the signature of each ordinary file to determine the type. Guessing the type from the file extension speeds up the processing of ordinary files, but not subdirectories, which are always checked with casacore's ImageOpener. Turning off filtering entirely has no additional performance impact on the image file list (because detection of directory-based images is still required), but speeds up the region file list (because if no filtering is required, all directories can be shown without additional checks).

This PR is an attempt to add a less expensive heuristic for directory-based images, to be used when the option to filter by content is not selected. It performs almost the same checks as ImageOpener, but only for directories, by looking for files inside the directory with fs::exists, and without distinguishing between different CASA image subtypes.

The existing code assumes that there are directory image formats which we do not support, and handles them differently. However, it's clear from the ImageOpener code that the GIPSY format is a pair of files, not a directory (so ImageOpener would never return that type for a directory), and the CAIPS and NEWSTAR types are obsolete and never returned by ImageOpener (at all). So I have removed this option from the code, and not implemented it in the alternative code.

The result: the alternative code appears to be slightly faster, but I don't know if it's faster enough for it to make sense to add it as an alternative to the casacore code. If this implementation is sufficient for our purposes (e.g. we don't need to read the table.info file because we don't need do distinguish between CASA sub-types here), then perhaps we should replace the casacore check with this (for a modest speed improvement in all cases).

Other optimizations we discussed:

I think we're planning to use the last option as our long-term solution, and I would suggest applying that strategy to all files: instead of loading file information up-front when generating the file list, we could initially return just a bare list of files and directories, and then return information for lists of files and directories as the frontend requests them.

Checklist

github-actions[bot] commented 3 weeks ago

Code Coverage

Package Line Rate Health
src.Cache 72%
src.DataStream 44%
src.FileList 67%
src.Frame 36%
src.HttpServer 42%
src.ImageData 28%
src.ImageFitter 83%
src.ImageGenerators 44%
src.ImageStats 75%
src.Logger 37%
src.Main 52%
src.Region 69%
src.Session 4%
src.Table 52%
src.ThreadingManager 67%
src.Timer 85%
src.Util 40%
Summary 46% (8632 / 18823)
kswang1029 commented 4 days ago

I did a test with a directory with 20000 fits image files, and a directory with 20000 casa image directories. The text computer is a local build with a fast SSD. I did purge disk cache for each test case.

The test results (in seconds) are summarized in the following table: Screenshot 2024-07-01 at 11 06 30

We see improvements of ~2x for the casa image with the "filter by extension" and the "all files" modes. Is this expected in this improvement and consistent with what you tested and observed?

The psrecord plots are attached below as a record. The first plateau is casa image and the second is fits image.

old - content old_content

new - content new_content

old - extension old_extension

new - extension new_extension

old - all old_all

new - all new_all

It appears to me that once we access the context of a file, the CPU usage won't be 100% in my test case.