Tolerant mode - Githubissues

ales-t / rjp

Rapid JSON-lines processor

Apache License 2.0

3 stars 0 forks source link

Tolerant mode #7

Open ales-t opened 2 years ago

ales-t commented 2 years ago

Currently, rjp will stop when encountering any error, such as:

Select/rename not finding the required fields in an instance.
serde failing to parse an input line.
Join not finding the keys to join on in an instance.
Merge when the stream lengths are mismatched.
...

Oftentimes, JSON lines files are noisy and contain lines with problems. It would be helpful to have some way to request "tolerant" behavior. For example, rename_field would not change an instance if the input fields are not found. Alternatively, problematic instances may be skipped in the output stream.

It's currently not clear to me how the various processors should behave in this tolerant setting, or whether there should be more ways (for instance --skip-bad-instances, --stop-on-bad-instance, --keep-bad-instances?).

zouharvi commented 2 years ago

Sounds like every processor should have its own handling of errors (the default currently being panic). They should be modified by these flags but the question is whether we should also have flags that are specific to certain processors. I'd somewhat prefer if the number of optional command line arguments was kept as low as possible to make the CLI easier to use.

Also maybe they should not be flags but rather enums like --bad-instance {stop,skip,keep}? They seem pretty exclusive and we would want to error on the combination of rjp --skip-bad-instances --stop-on-bad-instance anyway.

What is going to be the default? This --stop-on-bad-instance? It's intuitive but then we should expect half-processed outputs which does not seem good.

The bad instances count should definitely go to the stderr final summary which is already there.

ales-t commented 2 years ago

Sounds like every processor should have its own handling of errors (the default currently being panic).

If you find that rjp panics, please create an issue. I'm under the impression that all errors are now transformed into RjpError and propagated into main in a clean way.

Sounds like every processor should have its own handling of errors (the default currently being panic). They should be modified by these flags but the question is whether we should also have flags that are specific to certain processors.

I agree -- each processor should have its own way of interpreting the flags but for simplicity, the flag should be global. If you need to handle errors differently in different parts of the processing pipeline, you can always just call rjp twice and connect them with a unix pipe.

Also maybe they should not be flags but rather enums like --bad-instance {stop,skip,keep}?

I like that solution a lot.