ARPA-SIMC / arkimet

A set of tools to organize, archive and distribute data files.
Other
14 stars 5 forks source link

Unable to delete data before a given day #286

Open mnuccioarpae opened 2 years ago

mnuccioarpae commented 2 years ago

I need to delete the data before 2011-12-21 for station 6257. So I ran the following commands:

$ DS=/arkivio/arkimet/dataset/locali
$ arki-query 'area:VM2,6257;reftime:<2011-12-21' $DS > 6257-da-eliminare.arkimet
$ arki-check --fix --remove 6257-da-eliminare.arkimet "$DS"

As a check, I run the following command:

$ arki-query --summary --dump 'area:VM2,6257' $DS > 6257-summary.txt
$ grep Reftime: 6257-summary.txt | cut -d\  -f4 | sort -u
2011-12-21T00:00:00Z
2011-12-21T10:00:00Z
2011-12-21T11:00:00Z
2011-12-23T00:00:00Z
2014-12-31T23:30:00Z

But, if I restrict the query to an interval of reftimes starting before the 2011-12-21, I see the old data. For example, limiting to the variable 78, I get:

$ arki-query --summary --dump 'area:VM2,6257;product:VM2,78' $DS
SummaryItem:
  Product: VM2(78, bcode=B12101, l1=2000, lt1=103, p1=0, p2=3600, tr=0, unit=C)
  Area: VM2(6257,lat=4436161, lon=1192193, rep=locali)
SummaryStats:
  Count: 86525
  Size: 3156064
  Reftime: 2011-12-21T11:00:00Z to 2022-02-02T08:00:00Z

$ arki-query --summary --dump "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" $DS
SummaryItem:
  Product: VM2(78, bcode=B12101, l1=2000, lt1=103, p1=0, p2=3600, tr=0, unit=C)
  Area: VM2(6257,lat=4436161, lon=1192193, rep=locali)
SummaryStats:
  Count: 237
  Size: 8562
  Reftime: 2011-12-15T00:00:00Z to 2011-12-25T23:00:00Z

$ arki-query --data "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" $DS | head
201112150000,6257,78,6.5,,,000000000
201112150100,6257,78,6.2,,,000000000
201112150200,6257,78,5.9,,,000000000
201112150300,6257,78,5.8,,,000000000
201112150400,6257,78,5.7,,,000000000
201112150500,6257,78,5.7,,,000000000
201112150600,6257,78,5.6,,,000000000
201112150700,6257,78,5.5,,,000000000
201112150800,6257,78,5.5,,,000000000
201112150900,6257,78,5.5,,,000000000

What am I doing wrong?

Thanks

mnuccioarpae commented 2 years ago

rm $DS/.summaries/* fixed the problem of wrong summary data.

However the data is still there even after having repeated the deletion.

spanezz commented 2 years ago

I tried to reproduce the problem and it works as expected:

$ cd /arkivio/arkimet/dataset/locali
$ tar acf ~/issue286.tar.gz config 2011/12-*
$ cd ~
$ mdir test
$ cd test
$ tar axf ~/issue286.tar.gz
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
25157
$ arki-query 'area:VM2,6257;reftime:<2011-12-21' . > /tmp/cancellare
(il file cancellare sono 5.8M)
$ arki-check --fix --remove /tmp/cancellare  .
(i timestamp dei file .index fino a 12-20.vm2.index sono cambiati)
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
0

But once I try copypasting your query, I get data:

$ arki-query --data "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" . | wc -l
109

On closer inspection, the data has been removed until the 21st of december, but the later queries are until the 25th of december, so they correctly show data between the 21st and hte 25th.

It seems that arkimet is working as expected, but it's really hard to see the difference between 21 and 25 among all those numbers (it took me quite a while to see it, too)

mnuccioarpae commented 2 years ago

@spanezz if you look at the results, you can see that the reftime of the first record is 201112150000, which is 15 Dec 2011.

I suspect the problem is not reproducible on a small dataset. I have removed decades of data. The arki-check command took a long time to finish.

Maybe I can try removing the data in batches of smaller subsets, for example, one variable at a time.

spanezz commented 2 years ago

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

If we both worked on the same dataset, it looks like when I took a copy of it, the data had not been deleted. I now suspect something went wrong in the deletion when you ran it, and worked when I ran it on a subset of the dataset.

I'm also considering making arki-check --remove print statistics on the number of elements deleted

mnuccioarpae commented 2 years ago

Unfortunately, I did not notice the error immediately because I did check only the summary with arki-query --summary, not the data, and the summary was wrong.

The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?

Maybe we can check that the expected result is consistent with the final result. For example, we can check that the number of records before and after is correct.

mnuccioarpae commented 2 years ago

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

An advantage of the current system is that it forces you to backup the deleted data.

spanezz commented 2 years ago

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

An advantage of the current system is that it forces you to backup the deleted data.

That is a very good point, I never considered it that way. It's a tricky backup, since the results of the query do not contain the data. But until the dataset is repacked, the results of the query should contain valid references to the deleted data still in the dataset.

mnuccioarpae commented 2 years ago

It's a tricky backup, since the results of the query do not contain the data

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

spanezz commented 2 years ago

The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?

In theory the summaries are deleted while the data is deleted and regenerated at the end. The actual deletion is performed in the index files, which are the main files you should expect to see modified by the deletion

edigiacomo commented 2 years ago

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

Sorry @spanezz but maybe I don't understand which query you're referring to:

# Save the result of the query
[arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet
# Copy the result in my laptop
[edg 🫒  ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet .
# Check that the original path is not reachable
[edg 🫒  ~]$ arki-scan --dump buttami.arkimet | head -n 1
Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37)
[edg 🫒  ~]$ ls /arkivio/arkimet/dataset/boa
ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory
# Extract the data
[edg 🫒  ~]$ arki-scan --data buttami.arkimet | head
202202090000,12626,139,49,,,000000000
202202090000,12626,158,7.7,,,000000000
202202090000,12626,164,0,,,000000000
202202090000,12626,166,0.4,,,000000000
202202090000,12626,629,0.13,,,000000000
202202090000,12626,631,3.6,,,000000000
202202090000,12626,632,3.3,,,000000000
202202090000,12626,683,-0.09,,,000000000
202202090000,12626,684,0.25,,,000000000
202202090000,12628,139,41,,,000000000
spanezz commented 2 years ago

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

Sorry @spanezz but maybe I don't understand which query you're referring to:

# Save the result of the query
[arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet
# Copy the result in my laptop
[edg 🫒  ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet .
# Check that the original path is not reachable
[edg 🫒  ~]$ arki-scan --dump buttami.arkimet | head -n 1
Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37)
[edg 🫒  ~]$ ls /arkivio/arkimet/dataset/boa
ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory
# Extract the data
[edg 🫒  ~]$ arki-scan --data buttami.arkimet | head
202202090000,12626,139,49,,,000000000
202202090000,12626,158,7.7,,,000000000
202202090000,12626,164,0,,,000000000
202202090000,12626,166,0.4,,,000000000
202202090000,12626,629,0.13,,,000000000
202202090000,12626,631,3.6,,,000000000
202202090000,12626,632,3.3,,,000000000
202202090000,12626,683,-0.09,,,000000000
202202090000,12626,684,0.25,,,000000000
202202090000,12628,139,41,,,000000000

Ah interesting, then it works for VM2 data only, because the metadata contain enough information to reconstruct the data. For other formats, I would expect this not to work

edigiacomo commented 2 years ago

Nice! Is it possible that this feature is format agnostic and it's enabled by the smallfiles option (https://github.com/ARPA-SIMC/arkimet/blob/v1.40-1/doc/datasets.rst)?

spanezz commented 2 years ago

No, smallfiles are only supported for VM2, since a VM2 data can be reconstructed with is its arkimet metadata plus a small string. For all other formats it does not make any sense, since to preserve the data one has to copy all of it after the metadata, and that's what --inline does.

It is not however possible to delete data from the output of arki-query --inline, because with --inline the reference to the data in the dataset is replaced with the data itself

spanezz commented 2 years ago

I've reworked deletion for iseg datasets (which are now the only datasets that support deletion) to group data to delete by segment, and do one transaction per segment. The result should be much faster.

I've also added, with --verbose, feedback for each segment:

$ time arki-check --fix --remove 6257-da-eliminare.arkimet  test1 --verbose
INFO test1: 2011/12-19.vm2: 1297 data marked as deleted
INFO test1: 2011/12-18.vm2: 1297 data marked as deleted
INFO test1: 2011/12-17.vm2: 1297 data marked as deleted
INFO test1: 2011/12-14.vm2: 1296 data marked as deleted
INFO test1: 2011/12-20.vm2: 545 data marked as deleted
INFO test1: 2011/12-01.vm2: 1293 data marked as deleted
INFO test1: 2011/12-15.vm2: 1297 data marked as deleted
INFO test1: 2011/12-13.vm2: 1297 data marked as deleted
INFO test1: 2011/12-02.vm2: 1295 data marked as deleted
INFO test1: 2011/12-03.vm2: 1294 data marked as deleted
INFO test1: 2011/12-04.vm2: 1295 data marked as deleted
INFO test1: 2011/12-16.vm2: 1296 data marked as deleted
INFO test1: 2011/12-05.vm2: 1297 data marked as deleted
INFO test1: 2011/12-07.vm2: 1297 data marked as deleted
INFO test1: 2011/12-11.vm2: 1289 data marked as deleted
INFO test1: 2011/12-06.vm2: 1297 data marked as deleted
INFO test1: 2011/12-08.vm2: 1294 data marked as deleted
INFO test1: 2011/12-09.vm2: 1297 data marked as deleted
INFO test1: 2011/12-10.vm2: 1293 data marked as deleted
INFO test1: 2011/12-12.vm2: 1294 data marked as deleted
INFO test1: 25157 data marked as deleted

real    0m4.361s
user    0m1.884s
sys 0m1.120s
spanezz commented 2 years ago

Redoing the deletion with current master should be much faster and have a far less boring output.

Hopefully data should also stay deleted, but I still have no idea how come it didn't get deleted when you first tried :(

edigiacomo commented 2 years ago

@mnuccioarpae arkimet 1.41-1 is available in the Copr repository.