Poor performance when reading parquet dataset from s3

b00033811 commented 4 years ago

I've been facing extremely poor performance reading a parquet dataset stored in s3.

Partitioning style: year=2020/month=10/day=18/hour=00 Compression: snappy dataset size=40mb

The data is coming from kafka connect s3 sink, there are no _metadata files. With debugging enabled i can see awswrangler accessing each file sequentially with significant latency. Using pyarrow, reading the partition "day=18" takes 8 seconds;But with awswrangler it takes up to 27 seconds. I've tried running the same script inside the VPC in an EC2 instance, however i am getting the same results. Basically what im trying to do is a batch job on the day partition and an upsert into redshift. Is reading data as parquet (snappy compression) and doing simple ETL then upserting into redshift a bad approach?

My team and I have been struggling with this issue for a while, any help or recommendation is greatly appreciated!

ps: Id rather use awswrangler because it parses the dtypes of the dataframe more accuratly, unlike pyarrow.

igorborgest commented 3 years ago

Hi @b00033811 !

Thanks for reaching out. I would like to understand more about your use case.

Using pyarrow, reading the partition "day=18" takes 8 seconds;But with awswrangler it takes up to 27 seconds.

Are you passing the path as s3://.../year=2020/month=10/day=18/ or are you passing s3://.../ and filtering the partitions with the partition_filter argument?
How many files do you have under the year=2020/month=10/day=18? And under the entire dataset?
40 MB are for all files sizes under year=2020/month=10/day=18 aggregated? Or is it for each file?

Is reading data as parquet (snappy compression) and doing simple ETL then upserting into redshift a bad approach?

It's a good approach, we use it all time.

ps: Id rather use awswrangler because it parses the dtypes of the dataframe more accuratly, unlike pyarrow.

Wrangler uses PyArrow under the hood only for low level operations (file level), in the dataset level we made several different choices that results in different behaviors. Each choice is a trade-off, and seems to me that you probably are in some situation where we opted to keep better data types than performance. But I don't have too much context by now, let's try to figure out an more precise answer.

igorborgest commented 3 years ago

Closing due to inactivity.

konradsemsch commented 3 years ago

HI @igorborgest!

I would like to reopen this issue, as we also have seen quite unsatisfying performance using the read_parquet function. This is our setup and data below:

data is in S3, there are 1164 individual date-time prefixes under the main folder, and the total size of all files is barely 25.6 MB. So quite a lot of small individual files organized by individual prefixes by date-time
they way we gather these files is by passing path: s3://.../ and using the partition_filter. The function call looks like this:

wr.s3.read_parquet(
                path,
                dataset=True,
                partition_filter=filter(),
            )

I've run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter but the gain was marginal. Enabling use_threads=True gave no improvement. Overall it takes around 13 minutes to collect all files... this is just too long. Downloading them with aws sync takes a few seconds.

Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet). Please also note that we're not taking of big data here. Most of our use cases is like described above.

At the moment we're wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What's your take on that?

igorborgest commented 3 years ago

Hi @konradsemsch ! Could you open a new issue, please?

This poor level of performance was never observed in our tests... We have to investigate, it is probably some corner case...

konradsemsch commented 3 years ago

https://github.com/awslabs/aws-data-wrangler/issues/644

aws / aws-sdk-pandas

Poor performance when reading parquet dataset from s3 #429