Closed abhishekrb19 closed 2 years ago
My initial thought is that self-referencing definitions just won't work here. If the protobuf definition allows for arbitrary depth, but spark schemas don't, I don't see how we could reasonably infer a schema by the definition alone.
@crflynn, yes, that was my initial thought as well. Would it be reasonable for pbspark
to terminate the recursive schema inference at a configurable depth (and default to a reasonable depth, say 2 or 3)? Alternatively, provide an option to skip the recursive protobuf bits altogether (this may seem like a hack, but would work for my use case at least because the protobuf data is guaranteed to not be recursive)?
It's been a while but I think a custom serializer and deserializer for that message type would do the trick. You'd have to specify a parsing function and the return type schema for that message in particular (basically your desired depth). There is an example in the readme.
You can also refer to the timestamp functions that handle datetime serde here: https://github.com/crflynn/pbspark/blob/f2da85dda30d585d00869ab68fae5e8c460981dd/pbspark/_timestamp.py#L20
I took another look at this problem and added an example+test. You can take a look at it here: https://github.com/crflynn/pbspark/blob/fd75e9d46706bc9b48e2bc0e058ad0e32d95604f/tests/fixtures.py#L14 https://github.com/crflynn/pbspark/blob/fd75e9d46706bc9b48e2bc0e058ad0e32d95604f/tests/test_proto.py#L221
Were you able to resolve this?
I am closing and labeling this wontfix
for now.
Hello,
Thanks for developing the
pbspark
library. This seems quite useful for converting protobuf on the wire to dataframes. I have some recursive proto definitions of the form (a simplified example):A minimal code example that shows the issue:
The above code barfs with a
RecursionError
atfrom_protobuf
while trying to infer the Spark schema from the recursive protobuf schema:Wondering how to get around this @crflynn? Thanks!