Archive retention policy and tools

utzeln commented 10 years ago

As discussed with Kay, the concept of retention of archived data is in place on the database level – i.e. we can configure and store the retention information of each PV:

Table retent {retent_id, descr}
In table channel {…, retent_id, retent_val,…}

PS: for now only the archive reporting tool allows to enter the retention information via the web browser. It could be relevant to add a specific syntax for that in BEAUTY xml configuration file.

Even if retention could be defined, no tool is really using it.

It is proposed to develop a tool that looks for all PVs with "yearly" retention for instance and then deletes the values that are too old.

Or looks for all PVs with "reduce" type of retention, then somehow reduces the volume of data by simply keeping every 10th value for instance after 1 day.

kasemir commented 10 years ago

It is a good idea to allow import and export of the retention config. via the XML file.

It is proposed to develop a tool that looks for all PVs with "yearly" retention for instance and then deletes the values that are too old.

One detail with this: A plain "delete from samples where smpl_time < ..." Is likely very slow. Sites that are interested in dropping data by week, month, year are much better off by preparing partitioned sample tables, so then the yearly deletion turns into an instanteneous "drop partition" command. Or looks for all PVs with "reduce" type of retention, then somehow reduces the

volume of data by simply keeping every 10th value for instance after 1 day

Similarly, the best implementation for this would be based on partitions, copying/extracting the data into a reduced new partition, then replacing the original partition with the new one, and dropping the old one.

laurent-PHILIPPE commented 8 years ago

I would be very interesting to have a generic program to remove or "reduce" old data based on retent table information. I used CSS archive over Mysql and I have currently 8GO of

What is the status of this project ?

kasemir commented 8 years ago

On Mar 18, 2016, at 9:49 AM, laurent-PHILIPPE notifications@github.com wrote:

I would be very interesting to have a generic program to remove or "reduce" old data based on retent table information. I used CSS archive over Mysql and I have currently 8GO of

What is the status of this project ?

Hi:

No progress.

In principle, it’d be an easy student project to write a tool that looks for all PVs where retention is configured as “delete after 1 year", and then issue the

DELETE FROM sample WHERE smpl_time < ..;

But that SQL statement is likely to run for a very long time, and it won’t free up much disk space unless you also perform some ‘pack’ operation that’s specific to your RDB.

Our Oracle setup uses partitions, so it’s very easy to simply remove a whole partition, but that would be by date, for all channels.

Going forward, at SNS we’re actually interested in the Archive Appliance. In CSS, to end users, it would be the same, but underneath the RDB is replaced with the Archive Appliance.

It’s looking good, but it’s not a simple drop-in replacement.

Our operational setup is comparably simple: About 90000 PVs, generating about 1500 updates per second. The archive appliance is supposed to handle a million channels and more updates per second. But for now we have not managed to simply reproduce our existing setup with the archive appliance.

In principle, the archive appliance can load the existing *.xml config files of the channel archiver or the RDB setup. But our config files use the “HH:MM:SS” notation for the period, while the archive appliance can only handle “1.23” seconds. Plus if you load many channels, it basically pauses as soon as there are 1000 disconnected channels. So we can’t simply bulk-load our existing setup but run them through a little patching and connection checking.

Our RDB archive engines use basically no CPU, since it’s only 1500 samples/second overall. All the work is on the Oracle server side. The archive is the biggest task in terms of CPU and disk space for our Oracle setup, but it’s handling it. Retrieval isn’t as fast as we’d like, though, which would be the main reason for upgrading. For the archive appliance, since the ‘engine’ taking the data and the ‘storage’ which writes them is all on one computer, we do see CPU and memory usage way higher than with just the old archive engine.

For one PV “X”, instead of once fetching the meta data (limits, units, ..) and then subscribing to just “X”, I believe the appliance subscribes to “X”, “X.HIHI”, “X.LOLO” etc. depending on the channel name and the RTYP field. That’s great because a “policy” allows you to for example archive the RBV for all motor records. But it also causes the channel count to explode just because you want to catch a change in HIHI.

The appliance does support several levels of storage: Last hour, last day, last year, and as data is moved across these storage levels, it’s basically possible to reduce the data. While for now only “don’t copy” might be implemented, this is a good hook where more sophisticated reduction could be added - finally coming back to your question!

So the archive appliance looks great, I hope it’ll work out, but for us we’re not quite there.

Thanks, Kay

utzeln commented 8 years ago

We are currently setting up a retention policy in order to reduce and move “old samples” and we intend to split our current sample table of 1 447 873 488 rows and have separate tablespaces and schemas. First we will setup multiple write sample tables – CRYO, MAGNET… Then each archive engine will be configured in order to write in the dedicated table – for instance org.csstudio.archive.writer.rdb/write_sample_table=sample_CRYO.

And new tables will be created: create table sample_CRYO () inherits (sample); create table sample_RETENT () inherits (sample);

This will result in having the sample table to query all samples – from CRYO, MAGNET and from tables with retention policy, the sample_CRYO table to query samples from CRYO only and the sample_RETENT for old samples only. With this approach, there is no need to change the reader application already connected to the sample table which provides then the global view of all samples.

From there, yearly tables will be created in different schema: create table sample_2012 ( check(record_date >= ‘2012-01-01 00:00:00’::timestamp and record_date < ‘2013-01-01 00:00:00’::timestamp) ) inherits (sample_RETENT) tablespace retent_tablespace; … create table sample_2016 ( check(record_date >= ‘2016-01-01 00:00:00’::timestamp and record_date < ‘2017-01-01 00:00:00’::timestamp) ) inherits (sample_RETENT) tablespace retent_tablespace;

This way Postgres (in our case) knows that samples in these tables are only in a designed time period. If a query is done on 2016 period of time for example, Postgresql will not search in 2012, 2013, 2014 and 2015 tables. In addition, with the use of tablespace, old data can be separated physically as well as logically by schema.

Finally, a table for the samples of the last month will be created sample_month, samples that will be moved to sample_year at the end of the period with some data reduction. At the end of the year, samples from sample_year will be moved to sample_2016 and be reduced.

For us, the model looks like that with our system in production – UTIL for utilities and BUIL for building:

An important step is to create functions/procedures/triggers to move samples from:

sample_UTIL and sample_BUIL to sample_month each night – no retention policy will be applied at this stage
sample_month to sample_year at the end of each month – the retention defined by the user will be applied on PV basis
sample_year to sample_2016 at the end of the year – a new reduction will be applied
vaccum the sample_UTIL, sample_BUIL, sample_month and sample_year after moving and reducing samples

During the move, a retention policy will be applied on PV basis – retent_id and retent_val columns of channel table. The retention policy needed to be setup in retent table is:

{1, “delete samples after 1 month”}
{2, “mean over a period of time after 1 month”}
{3, “reduction by a given factor after 1 month”}

Then, the retention policy has to be defined on PVs such as:

UTIL-JT4 retent_id=1 retent_val=0 to delete the samples
UTIL-JT retent_id=2 retent_val=60 to keep the mean over 1 minute
UTIL-IT retent_id=3 retent_val=25 to keep 1 sample and drop 3

After 1 year, the applied retention is increased from the user defined one:

Mean period {60, 600, 3600} – if the user defines 60 after 1 month, the mean over 600 will be calculated after 1 year
Reduction factor {10, 25, 50, 75, 90, 100}

We are currently writing and testing these 3 retention policies and we plan to extend the archive configuration xml syntax to support the retention policy.

laurent-PHILIPPE commented 8 years ago

Hi Utzeln, Kay

First, thanks for your answers and sorry for my dalayed answer.

Do you know which laboratory are using ArchiveEngine and have planned to used it for a long time?

Kay, When do you think you will migrate to Archive Appliance?

Do you think Archive Apliance is a mature product?

Utzeln, your proposition is very closed from the GANIL view. I would appreciate a lot if you can keep me in touch of your progress. We currently used Mysql as database server so I can not reused directly (and mysql doesn't implement the "inherited" link, but only a "simple" partitioning). However we planned to migrate our server next year, so I can consider a Postgres Migration (We already used Postgres for some software package).

Considering select performance, I know there are some nosql database specialized in temporal data, they would be a good track to look.

Best regards.

ControlSystemStudio / cs-studio

Archive retention policy and tools #434