Closed ttelfer closed 2 years ago
can you post your simple_pb2.py
file?
vscode ➜ /workspace $ /usr/local/bin/protoc --version
libprotoc 3.21.1
vscode ➜ /workspace $ cd pb
vscode ➜ /workspace/pb $ /usr/local/bin/protoc -I . --python_out=. --pyi_out=. --proto_path=. ./*.proto
simple_pb2.py
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: simple.proto
"""Generated protocol buffer code."""
from google.protobuf.internal import builder as _builder
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0csimple.proto\x12\x02pb\"@\n\rSimpleMessage\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x10\n\x08quantity\x18\x02 \x01(\x03\x12\x0f\n\x07measure\x18\x03 \x01(\x02\x62\x06proto3')
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'simple_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_SIMPLEMESSAGE._serialized_start=20
_SIMPLEMESSAGE._serialized_end=84
# @@protoc_insertion_point(module_scope)
I'm not able to reproduce this so far. I have a local environment using
protoc 21.1
java adoptopenjdk-8.0.275+1
python 3.10.5
and using the code and requirements.txt you've submitted here.
When I run it myself, I don't hit that pickle error, however I do get another error that looks like this:
22/06/06 17:38:18 ERROR TaskSetManager: Task 2 in stage 5.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/Users/flynn/projects/pbsparktest/code.py", line 40, in <module>
df_again.show()
File "/Users/flynn/projects/pbsparktest/.venv/lib/python3.10/site-packages/pyspark/sql/dataframe.py", line 494, in show
print(self._jdf.showString(n, 20, vertical))
File "/Users/flynn/projects/pbsparktest/.venv/lib/python3.10/site-packages/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/Users/flynn/projects/pbsparktest/.venv/lib/python3.10/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/Users/flynn/projects/pbsparktest/.venv/lib/python3.10/site-packages/pbspark/_proto.py", line -1, in decoder
TypeError: expected bytes, bytearray found
which is easy to fix. Once fixed, I'm able to run the code you've posted fine. I also tried removing the CloudPickleSerializer
from the SparkContext
, and it still worked.
Would it be possible to set up a repo with an example project and some commands that would reproduce easily?
Here you go:
I think I figured it out. The problem is how you are generating your proto files, and then subsequently referencing them in python.
Note that in this project, when we generate using protoc in the Makefile we run something like this
poetry run protoc -I . --python_out=. --mypy_out=. --proto_path=. ./example/*.proto
We run this from the root directory and the resulting pb2 file creates messages with a reference to the fully qualified module here: https://github.com/crflynn/pbspark/blob/387eb74fc578145077117a976234e758510d5f5c/example/example_pb2.py#L23
When you are running your protoc you are running it from within the pb
folder, which results in the fully qualified module being incorrect here: https://github.com/ttelfer/spark_proto/blob/02580c88e16245337ba2fc246e2bc67b5c7b3612/pb/simple_pb2.py#L19 which differs from how it's referenced in business logic here: https://github.com/ttelfer/spark_proto/blob/02580c88e16245337ba2fc246e2bc67b5c7b3612/test_proto.py#L6
Note the simple_pb2
vs pb.simple_pb2
whereas in the pbspark repo both are example.example_pb2
.
This reference is important when it comes to pickling. Since this module reference is wrong it's not able to pass the pyspark udf down to the workers via pickling because it can't reference the message in the same way you would import it.
If you change that module reference to pb.simple_pb2
, or rather regenerate your pb2 files by invoking protoc from the root directory, I think it should work. Although you will probably run into the bytearray bug I found above due to the newer version of protobuf. If you upgrade to 0.5.1 it should work.
@crflynn running protoc as you suggested fixed the problem Thank you for taking a look.
Can I buy you a coffee?
No problem, and thanks for helping out with posting your project code.
I appreciate the offer; you should buy one for a friend of yours instead or consider making a donation to an organization like the EFF.
Been having some problem getting pbspark to work.
I get the following error:
Using the following code:
simple.proto
requirements.txt
protoc version 21.1