Closed valterartur closed 11 months ago
Ok I understood now that this encoder is used for an jsonb array, would be appreciated for any code example I can use to store just jsonb
I think I figured it out, at least this example finally worked for me. Going to give it a shot with df.mapInArrow
import pyarrow as pa
import psycopg
import pgpq.schema
import pgpq.encoders
from pgpq import ArrowToPostgresBinaryEncoder
from pgpq.schema import PostgresSchema
# Sample data
batch = pa.RecordBatch.from_arrays(
[
pa.array(
["{}", '{"foo":"bar"}', "{}"],
type=pa.string(),
),
],
schema=pa.schema(
[
pa.field(
"json_list",
pa.string(),
),
]
),
)
encoders = {
"json_list": pgpq.encoders.StringEncoderBuilder.new_with_output(
batch.schema.field("json_list"), pgpq.schema.Jsonb()
),
}
encoder = ArrowToPostgresBinaryEncoder.new_with_encoders(batch.schema, encoders)
buffer = bytearray()
buffer.extend(encoder.write_header())
buffer.extend(encoder.write_batch(batch))
buffer.extend(encoder.finish())
dsn = "DSN"
conn = psycopg.connect(dsn)
schema = 'schema'
t_name = 'data'
ddl = f"""
CREATE TABLE IF NOT EXISTS {schema}.{t_name}
(
json_list jsonb
)
"""
with psycopg.connect(dsn) as conn:
with conn.cursor() as cursor:
cursor.execute(ddl)
conn.commit()
with cursor.copy(f"COPY {schema}.{t_name} FROM STDIN WITH (FORMAT BINARY)") as copy:
copy.write(buffer)
conn.commit()
Thanks again for such awesome tool, if you have any comments on this code let me know please
That looks good to me! Glad you were able to figure it out. Please feel free to suggest changes to docs, add an example, etc. Contributions or donations are welcome :)
Firstly, I want to extend my gratitude to the maintainers and contributors of the pgpq library. It's an invaluable tool, and I truly appreciate the work put into it.
I have been working with PySpark's mapInArrow functionality, processing data and intending to write it to a PostgreSQL table with a column of type jsonb. While the processing using Arrow functions works flawlessly, I am encountering issues during the write operation to the jsonb column. To better understand this, I isolated the problem and tried some basic operations with pgpq and PyArrow.
However, even in the simplified scenario, I am facing challenges writing to a jsonb column, though the same process works fine for a jsonb[] column.
Reproducible code
The above example fails for a jsonb column but succeeds for a jsonb[] column.
The exception is
Is this a known limitation or oversight? Or could I be misusing the serialization process? Any guidance would be immensely helpful.