crflynn / pbspark

protobuf pyspark conversion
MIT License
21 stars 5 forks source link

Binary column is parsed as string column #10

Closed rolanddb closed 2 years ago

rolanddb commented 2 years ago

Hi, thank you for this useful library.

I have tried to convert the following protobuf:

message ImageRecord {
    uint64 event_id = 1;
    uint64 event_timestamp = 2;
    uint64 capture_timestamp = 3;
    bytes image = 5;
}

However, the bytes are converted to string, as shown with printSchema():

 |    |-- imageRecord: struct (nullable = true)
 |    |    |-- eventId: long (nullable = true)
 |    |    |-- eventTimestamp: long (nullable = true)
 |    |    |-- captureTimestamp: long (nullable = true)
 |    |    |-- image: string (nullable = true)

I managed to get the bytes by doing a unbase64 transformation on the column, but I feel that this is a bug in the library.

crflynn commented 2 years ago

I think this is because protobuf's MessageToDict converts bytes to b64 encoded strings, but perhaps it should just convert to ByteType.