Closed b00033811 closed 3 years ago
Hi @b00033811 !
Thanks for reaching out. I would like to understand more about your use case.
Using pyarrow, reading the partition "day=18" takes 8 seconds;But with awswrangler it takes up to 27 seconds.
s3://.../year=2020/month=10/day=18/
or are you passing s3://.../
and filtering the partitions with the partition_filter
argument?year=2020/month=10/day=18
? And under the entire dataset?40 MB are for all files sizes under year=2020/month=10/day=18
aggregated? Or is it for each file?
Is reading data as parquet (snappy compression) and doing simple ETL then upserting into redshift a bad approach?
It's a good approach, we use it all time.
ps: Id rather use awswrangler because it parses the dtypes of the dataframe more accuratly, unlike pyarrow.
Wrangler uses PyArrow under the hood only for low level operations (file level), in the dataset level we made several different choices that results in different behaviors. Each choice is a trade-off, and seems to me that you probably are in some situation where we opted to keep better data types than performance. But I don't have too much context by now, let's try to figure out an more precise answer.
Closing due to inactivity.
HI @igorborgest!
I would like to reopen this issue, as we also have seen quite unsatisfying performance using the read_parquet
function. This is our setup and data below:
s3://.../
and using the partition_filter
. The function call looks like this:wr.s3.read_parquet(
path,
dataset=True,
partition_filter=filter(),
)
I've run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter
but the gain was marginal. Enabling use_threads=True
gave no improvement. Overall it takes around 13 minutes to collect all files... this is just too long. Downloading them with aws sync
takes a few seconds.
Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet
). Please also note that we're not taking of big data here. Most of our use cases is like described above.
At the moment we're wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What's your take on that?
Hi @konradsemsch ! Could you open a new issue, please?
This poor level of performance was never observed in our tests... We have to investigate, it is probably some corner case...
I've been facing extremely poor performance reading a parquet dataset stored in s3.
Partitioning style: year=2020/month=10/day=18/hour=00 Compression: snappy dataset size=40mb
The data is coming from kafka connect s3 sink, there are no _metadata files. With debugging enabled i can see awswrangler accessing each file sequentially with significant latency. Using pyarrow, reading the partition "day=18" takes 8 seconds;But with awswrangler it takes up to 27 seconds. I've tried running the same script inside the VPC in an EC2 instance, however i am getting the same results. Basically what im trying to do is a batch job on the day partition and an upsert into redshift. Is reading data as parquet (snappy compression) and doing simple ETL then upserting into redshift a bad approach?
My team and I have been struggling with this issue for a while, any help or recommendation is greatly appreciated!
ps: Id rather use awswrangler because it parses the dtypes of the dataframe more accuratly, unlike pyarrow.