Open brancomat opened 1 year ago
The first thing that comes to my mind, now that we dismissed the old ondisk2
format, is that it is technically possible to arki-check only selected parts of the dataset, so I could implement an option to arki-check to pass a subpath to check inside the dataset, that can speed up investigating issues seen in a specific location.
In the buffer does not start with 'BUFR'
case, we're deep into manual intervention territory: the index knows that a BUFR starts at some location, and at that location it's not finding the expected BUFR
string. The index cannot be trusted, and a query on that segment as it is will return garbage.
I guess arki-check can still do better than explode in this case. I could, for example, make it scream loudly and decide that the index needs to be rebuilt. Would that work?
I guess arki-check can still do better than explode in this case. I could, for example, make it scream loudly and decide that the index needs to be rebuilt. Would that work?
Yes, but consider that this was an edge case, the BUFR data was manipulated keeping the old indexes. This kind of issues need human attention in any case so maybe a lighter approach could be simply pointing out which file has the issue (it wasn't clear in the output) suggesting a possible discrepancy between the file and its index (the "BUFR validation failed" led me to think the data itself was an invalid BUFR, that wasn't the case)
I have a moderately large archive (95Gb, type
iseg
, formatbufr
) with some issues:The question is: what is the best way to investigate the error? It's not clear if the error is related to the last logged file (the
--debug
flag didn't add any significant output), I tried aarki-query --yaml --summary-short
on that file and the output seems ok. But I don't know if BUFR validation is triggered by a simple arki-query. Now I've started aarki-check --state
of the dataset but it's a bit time consuming and I'm not sure if it's the right choice.