Open K0nkere opened 1 year ago
Read parquet from s3 in batch mode
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://filepath",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
Read full parquet from file to pandas
pf = ParquetFile('../ny-taxi-data/yellow_tripdata_2021-01.parquet')
data = pf.read().to_pandas()
data.head()
Read parquet to pandas if there are Errors with standart read
import pandas as pd
import pyarrow as pa
from pyarrow.parquet import ParquetFile
file = ParquetFile(f"{year}/{color}_tripdata_{year}-{month:02n}.parquet")
d = file.read().to_pandas(safe=False)
# or
d = pa.parquet.read_table(f"{year}/{color}_tripdata_{year}-{month:02n}.parquet").to_pandas(safe=False)
Read parguet to pandas in batch mode