How to: Parquet files - Githubissues

K0nkere / DL_Dice-detection-project

DnD dice detection with CNN and transfer learning / Project for ML Bookcamp

0 stars 0 forks source link

How to: Parquet files #8

Open K0nkere opened 1 year ago

K0nkere commented 1 year ago

Read parguet to pandas in batch mode

import pyarrow as pa 
from pyarrow.parquet import ParquetFile

pf = ParquetFile('../ny-taxi-data/yellow_tripdata_2021-01.parquet') 
pf_iter = pf.iter_batches(batch_size = 100000)

while True:
    dataset_nrows= next(pf_iter) 
    df = pa.Table.from_batches([dataset_nrows]).to_pandas()
    ...

K0nkere commented 1 year ago

Read parquet from s3 in batch mode

import awswrangler as wr

df = wr.s3.select_query(
        sql="SELECT * FROM s3object s limit 5",
        path="s3://filepath",
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=True,
)

K0nkere commented 1 year ago

Read full parquet from file to pandas

pf = ParquetFile('../ny-taxi-data/yellow_tripdata_2021-01.parquet') 
data = pf.read().to_pandas()
data.head()

K0nkere commented 1 year ago

Read parquet to pandas if there are Errors with standart read

import pandas as pd
import pyarrow as pa
from pyarrow.parquet import ParquetFile

file = ParquetFile(f"{year}/{color}_tripdata_{year}-{month:02n}.parquet")
d = file.read().to_pandas(safe=False)

# or

d = pa.parquet.read_table(f"{year}/{color}_tripdata_{year}-{month:02n}.parquet").to_pandas(safe=False)