feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.47k stars 977 forks source link

Remote registry client does not map application errors #4392

Closed dmartinol closed 1 week ago

dmartinol commented 1 month ago

Expected Behavior

A feast application has a registry of type remote connected to a registry server. The feast client invokes a an API on the feast registry that thrown an error, e.g. store.get_feature_service("foo") which raises FeatureServiceNotFoundException from the registry server because foo is not a registered feature service.

The expectation is that the method invoked by the client raises the FeatureServiceNotFoundException and the application can detect it with:

try:
    fs = store.get_feature_service("foo")
except FeatureServiceNotFoundException
    ....

Current Behavior

The registy server always raises an _InactiveRpcError error instead.

This bug causes unexpected errors when the client invokes services whose implementation needs such expections to execute fall-back logic, like _get_feature_views_to_materialize.

Steps to reproduce

See Expected Behavior section.

Specifications

Possible Solution

The registry server can wrap the original exception in an error status message that the remote registry client can catch and translate into the original exception:

tokoko commented 1 month ago

@dmartinol (This might be more relevant for online store/offline store, but...) Do you think the goal here should be to forward all exceptions or only ones defined by feast? If we forward all, remote implementations will in a way become indistinguishable from concrete implementations, but on the other hand I think we might have a hard time raising exact same external exceptions on the client side when some of those libraries might not even be present in the client environment.

I think we should aim to "re-raise" feast exceptions, but suppress all others behind some new exception type.

dmartinol commented 1 month ago

@tokoko Yes, the problem is broader than described in the issue. Ideally, we should remap all exceptions across all servers to ensure the client code remains agnostic of the underlying implementation. Not an easy task, also considering the number of exception that have been modelled, but worth to invest some time. IMO, If we find a general way to "serialize" and "deserialize" it'd be great, at least for feast-exceptions and some core errors like ValueError and of course PermissionError.

dmartinol commented 2 weeks ago

@tokoko what about to tackle this one, maybe stating from the registry server? any idea/pattern for how to transfer and rebuild the exceptions? (I wanted to write "serialize and deserialize", but it could be misleading)

tokoko commented 2 weeks ago

@dmartinol I haven't looked into it too much, but I agree, "serialize and deserialize" is definitely the wrong way to go about it. I'm thinking, we should use richer grpc error messages (don't know what that means, but sounds suitable) to transfer two string values (exception type and exception message). I think we should start by limiting transferred exceptions exclusively to feast exception types and reassess later if there are any other core exceptions that also need to be treated in a similar fashion.

One reason for not trying to pass any more details (trace + other exceptions) is that at least in case of a feature server I think we should assume that a possible future server implementation might not be python-based, so binding error information to python too much can be a headache.

dmartinol commented 2 weeks ago

Agree on the error model, that has to be minimalistic and compatible with all the communication protocols used by the Feast servers. Are you already working on this issue or do you want any of us to contribute?

tokoko commented 2 weeks ago

No, not working on it. Feel free to take it up, let me unassign myself.

dmartinol commented 2 weeks ago

Looked at the grpc options, and I think we have an int value with a predefined status code that we cannot customize, and a detail label. Because of the limitations above, we can try to send a JSON in the detail label with all data required to rebuild the original error, WDYT?

...
except Exception as e:
    m = {
       "type": f"{type(e).__name__}",
       "message": f"{str(e)}"
    }
    context.abort(
        grpc.StatusCode.INTERNAL,
        json.dumps(m),
    )
...

To optimize it, we can limit the except to detect only the desired errors (Feast errors + selected builtins) and also try to map the code to a meaningful value,at least when it makes sense (e.g. PermissionError can be grpc.StatusCode.PERMISSION_DENIED and so on) Applying a similar "serialization" could help also the other comm protocols (REST and Arrow).

dmartinol commented 2 weeks ago

started draft PR https://github.com/feast-dev/feast/pull/4458 to verify we are in sync. @tmihalac @lokeshrangineni once we validate the design we can proceed in parallel for REST and Arrow servers.

tokoko commented 2 weeks ago

@dmartinol yeah, json sounds good to me, certainly better than a single concatenated string.

btw, we should probably create a FeastException type and rewrite all existing exceptions to extend it. That way we would be able to simply do except FeastException as e in the try block.

dmartinol commented 2 weeks ago

I was thinking the same, and can also have a to_grpc_status_code and to_http_status_code method to override instead of a global function. BUT, some errors are not in the feast package. Should we add a Feast version of builtins.PermissionError?

tokoko commented 2 weeks ago

yeah, I think so. We should switch to FeastPermissionException (or Error... whatever convention existing exceptions use) that will extend builtins.PermissionError.

dmartinol commented 2 weeks ago

or Error... whatever convention existing exceptions use)

Some numbers: 8 Feast classes named like *Exception and 22 like *Error. Considering that .*Error seems the python standard name pattern, I would go with FeastError and, of course FeastPermissionErrror (see also PEP 8)

tokoko commented 2 weeks ago

Agreed

dmartinol commented 1 week ago

@tokoko Can we keep this open to apply the same error mapping to the remaining servers, or do you prefer dedicated GH issues instead?

tokoko commented 1 week ago

dedicated issues would be better imho, they are almost completely different problems to solve from implementation perspective.

dmartinol commented 1 week ago

@tmihalac @lokeshrangineni could you pls raise server-specific issues to track the error mapping problem?