debugging BUFR validation errors

brancomat commented 1 year ago

I have a moderately large archive (95Gb, type iseg, format bufr) with some issues:

$ arki-check --debug -f -r /arkivio/arkimet/dataset/radar_cml_lepida_arc/
radar_cml_lepida_arc:2022/10-08.bufr: item at offset 160 is wrongly ordered before item at offset 62273396
radar_cml_lepida_arc:2022/10-08.bufr: possibly deleted data found not tracked by index: 3863152b would be freed by a repack
Traceback (most recent call last):
  File "/usr/bin/arki-check", line 11, in <module>
    main()
  File "/usr/bin/arki-check", line 7, in main
    sys.exit(Check.main())
  File "/usr/lib/python3.6/site-packages/arkimet/cmdline/base.py", line 83, in main
    return cmd.run()
  File "/usr/lib/python3.6/site-packages/arkimet/cmdline/check.py", line 125, in run
    arki_check.repack()
RuntimeError: BUFR validation failed: buffer does not start with 'BUFR'

The question is: what is the best way to investigate the error? It's not clear if the error is related to the last logged file (the --debug flag didn't add any significant output), I tried a arki-query --yaml --summary-short on that file and the output seems ok. But I don't know if BUFR validation is triggered by a simple arki-query. Now I've started a arki-check --state of the dataset but it's a bit time consuming and I'm not sure if it's the right choice.

spanezz commented 1 year ago

The first thing that comes to my mind, now that we dismissed the old ondisk2 format, is that it is technically possible to arki-check only selected parts of the dataset, so I could implement an option to arki-check to pass a subpath to check inside the dataset, that can speed up investigating issues seen in a specific location.

In the buffer does not start with 'BUFR' case, we're deep into manual intervention territory: the index knows that a BUFR starts at some location, and at that location it's not finding the expected BUFR string. The index cannot be trusted, and a query on that segment as it is will return garbage.

I guess arki-check can still do better than explode in this case. I could, for example, make it scream loudly and decide that the index needs to be rebuilt. Would that work?

brancomat commented 1 year ago

I guess arki-check can still do better than explode in this case. I could, for example, make it scream loudly and decide that the index needs to be rebuilt. Would that work?

Yes, but consider that this was an edge case, the BUFR data was manipulated keeping the old indexes. This kind of issues need human attention in any case so maybe a lighter approach could be simply pointing out which file has the issue (it wasn't clear in the output) suggesting a possible discrepancy between the file and its index (the "BUFR validation failed" led me to think the data itself was an invalid BUFR, that wasn't the case)

ARPA-SIMC / arkimet

debugging BUFR validation errors #295