ARPA-SIMC / arkimet

A set of tools to organize, archive and distribute data files.
Other
15 stars 5 forks source link

Allow local queries without write access to archive #258

Open dcesari opened 3 years ago

dcesari commented 3 years ago

It is sometimes useful, especially in a shared HPC environment, where it is difficult to keep a daemon running, to be able to perform a local query on filesystem without having write permission on the archive directories. It is acceptable to make the assumption that no messages are removed from the archived files, i.e. changes to the archives only add messages to files and indices.

spanezz commented 3 years ago

My question is, if I have no way to synchronize using the file system, how can I avoid that a repack on the dataset, which can potentially rewrite any part of an existing data segment, turns a running query into garbage?

I think answering that question depends on organizational structures and processes. For example, if queries are tied to specific times, one could add a dataset configuration defining query times and maintenance times on a daily schedule basis, like saying that one cannot query between 00:00 and 04:00, and one cannot do repacks between 04:00 and 24:00.

Or there could be the assumption that when a dataset is down for maintenance, it gets unmounted/unexported from the readonly part of the filesystem where the queries happen? Like, taken offline for maintenance?

It's ok to assume that messages are not removed. How about messages overwriting old ones (like datasets with rewrite=yes), where a rewrite is a deletion+import?

Note also that a repack would reorder data in a dataset without deleting anything. For example, if data is imported not in strictly reftime order, a repack reorders it so that a query, which returns data sorted by reftime, can read the segment as much as possible sequentially rather than jumping back and forth. I don't know how significant is the impact of that optimization, and I guess it would depend on what kind of data are in a dataset. I'd expect it to be worse for BUFR and VM2, and not so bad for big GRIBs and HDF5 files. It's ok not to do that if the performance change is understood not to be a big deal.

I feel like there are many options and none universally good, and I'd like to identify some scenarios in detail in order to identify specific sets of tradeoffs

dcesari commented 9 months ago

Actually a low-profile implementation that performs the query on a best-effort basis (possibly returning an error code if there is the chance to assess that some of the relevant metadata have changed in the middle) would be enough.

Of course this behavior shoud be enabled by an option acting as a disclaimer for the users that they can receive rubbish.

If there is a chance to implement such a behavior without a big effort we could go on, otherwise just close as WONTDO.