Open mnuccioarpae opened 2 years ago
rm $DS/.summaries/*
fixed the problem of wrong summary data.
However the data is still there even after having repeated the deletion.
I tried to reproduce the problem and it works as expected:
$ cd /arkivio/arkimet/dataset/locali
$ tar acf ~/issue286.tar.gz config 2011/12-*
$ cd ~
$ mdir test
$ cd test
$ tar axf ~/issue286.tar.gz
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
25157
$ arki-query 'area:VM2,6257;reftime:<2011-12-21' . > /tmp/cancellare
(il file cancellare sono 5.8M)
$ arki-check --fix --remove /tmp/cancellare .
(i timestamp dei file .index fino a 12-20.vm2.index sono cambiati)
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
0
But once I try copypasting your query, I get data:
$ arki-query --data "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" . | wc -l
109
On closer inspection, the data has been removed until the 21st of december, but the later queries are until the 25th of december, so they correctly show data between the 21st and hte 25th.
It seems that arkimet is working as expected, but it's really hard to see the difference between 21 and 25 among all those numbers (it took me quite a while to see it, too)
@spanezz if you look at the results, you can see that the reftime of the first record is 201112150000, which is 15 Dec 2011.
I suspect the problem is not reproducible on a small dataset. I have removed decades of data. The arki-check command took a long time to finish.
Maybe I can try removing the data in batches of smaller subsets, for example, one variable at a time.
I may be time to optimize deletion by directly passing a query to arki-check --remove
, indeed.
If we both worked on the same dataset, it looks like when I took a copy of it, the data had not been deleted. I now suspect something went wrong in the deletion when you ran it, and worked when I ran it on a subset of the dataset.
I'm also considering making arki-check --remove
print statistics on the number of elements deleted
Unfortunately, I did not notice the error immediately because I did check only the summary with arki-query --summary
, not the data, and the summary was wrong.
The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?
Maybe we can check that the expected result is consistent with the final result. For example, we can check that the number of records before and after is correct.
I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.
An advantage of the current system is that it forces you to backup the deleted data.
I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.
An advantage of the current system is that it forces you to backup the deleted data.
That is a very good point, I never considered it that way. It's a tricky backup, since the results of the query do not contain the data. But until the dataset is repacked, the results of the query should contain valid references to the deleted data still in the dataset.
It's a tricky backup, since the results of the query do not contain the data
I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"
The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?
In theory the summaries are deleted while the data is deleted and regenerated at the end. The actual deletion is performed in the index files, which are the main files you should expect to see modified by the deletion
I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"
Sorry @spanezz but maybe I don't understand which query you're referring to:
# Save the result of the query
[arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet
# Copy the result in my laptop
[edg 🫒 ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet .
# Check that the original path is not reachable
[edg 🫒 ~]$ arki-scan --dump buttami.arkimet | head -n 1
Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37)
[edg 🫒 ~]$ ls /arkivio/arkimet/dataset/boa
ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory
# Extract the data
[edg 🫒 ~]$ arki-scan --data buttami.arkimet | head
202202090000,12626,139,49,,,000000000
202202090000,12626,158,7.7,,,000000000
202202090000,12626,164,0,,,000000000
202202090000,12626,166,0.4,,,000000000
202202090000,12626,629,0.13,,,000000000
202202090000,12626,631,3.6,,,000000000
202202090000,12626,632,3.3,,,000000000
202202090000,12626,683,-0.09,,,000000000
202202090000,12626,684,0.25,,,000000000
202202090000,12628,139,41,,,000000000
I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"
Sorry @spanezz but maybe I don't understand which query you're referring to:
# Save the result of the query [arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet # Copy the result in my laptop [edg 🫒 ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet . # Check that the original path is not reachable [edg 🫒 ~]$ arki-scan --dump buttami.arkimet | head -n 1 Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37) [edg 🫒 ~]$ ls /arkivio/arkimet/dataset/boa ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory # Extract the data [edg 🫒 ~]$ arki-scan --data buttami.arkimet | head 202202090000,12626,139,49,,,000000000 202202090000,12626,158,7.7,,,000000000 202202090000,12626,164,0,,,000000000 202202090000,12626,166,0.4,,,000000000 202202090000,12626,629,0.13,,,000000000 202202090000,12626,631,3.6,,,000000000 202202090000,12626,632,3.3,,,000000000 202202090000,12626,683,-0.09,,,000000000 202202090000,12626,684,0.25,,,000000000 202202090000,12628,139,41,,,000000000
Ah interesting, then it works for VM2 data only, because the metadata contain enough information to reconstruct the data. For other formats, I would expect this not to work
Nice! Is it possible that this feature is format agnostic and it's enabled by the smallfiles
option (https://github.com/ARPA-SIMC/arkimet/blob/v1.40-1/doc/datasets.rst)?
No, smallfiles are only supported for VM2, since a VM2 data can be reconstructed with is its arkimet metadata plus a small string. For all other formats it does not make any sense, since to preserve the data one has to copy all of it after the metadata, and that's what --inline
does.
It is not however possible to delete data from the output of arki-query --inline
, because with --inline
the reference to the data in the dataset is replaced with the data itself
I've reworked deletion for iseg datasets (which are now the only datasets that support deletion) to group data to delete by segment, and do one transaction per segment. The result should be much faster.
I've also added, with --verbose
, feedback for each segment:
$ time arki-check --fix --remove 6257-da-eliminare.arkimet test1 --verbose
INFO test1: 2011/12-19.vm2: 1297 data marked as deleted
INFO test1: 2011/12-18.vm2: 1297 data marked as deleted
INFO test1: 2011/12-17.vm2: 1297 data marked as deleted
INFO test1: 2011/12-14.vm2: 1296 data marked as deleted
INFO test1: 2011/12-20.vm2: 545 data marked as deleted
INFO test1: 2011/12-01.vm2: 1293 data marked as deleted
INFO test1: 2011/12-15.vm2: 1297 data marked as deleted
INFO test1: 2011/12-13.vm2: 1297 data marked as deleted
INFO test1: 2011/12-02.vm2: 1295 data marked as deleted
INFO test1: 2011/12-03.vm2: 1294 data marked as deleted
INFO test1: 2011/12-04.vm2: 1295 data marked as deleted
INFO test1: 2011/12-16.vm2: 1296 data marked as deleted
INFO test1: 2011/12-05.vm2: 1297 data marked as deleted
INFO test1: 2011/12-07.vm2: 1297 data marked as deleted
INFO test1: 2011/12-11.vm2: 1289 data marked as deleted
INFO test1: 2011/12-06.vm2: 1297 data marked as deleted
INFO test1: 2011/12-08.vm2: 1294 data marked as deleted
INFO test1: 2011/12-09.vm2: 1297 data marked as deleted
INFO test1: 2011/12-10.vm2: 1293 data marked as deleted
INFO test1: 2011/12-12.vm2: 1294 data marked as deleted
INFO test1: 25157 data marked as deleted
real 0m4.361s
user 0m1.884s
sys 0m1.120s
Redoing the deletion with current master should be much faster and have a far less boring output.
Hopefully data should also stay deleted, but I still have no idea how come it didn't get deleted when you first tried :(
@mnuccioarpae arkimet 1.41-1 is available in the Copr repository.
I need to delete the data before 2011-12-21 for station 6257. So I ran the following commands:
As a check, I run the following command:
But, if I restrict the query to an interval of reftimes starting before the 2011-12-21, I see the old data. For example, limiting to the variable 78, I get:
What am I doing wrong?
Thanks