ddf-project / DDF

Distributed DataFrame: Productivity = Power x Simplicity For Scientists & Engineers, on any Data Engine
http://ddf.io
Apache License 2.0
167 stars 42 forks source link

[PE-2058] Improve load from S3 speed by using s3a instead of s3n #361

Closed lebinh closed 8 years ago

lebinh commented 8 years ago

Description and related tickets, documents

Replace s3n file system with s3a as it is faster and is stabilized enough in Hadoop 2.7 (https://wiki.apache.org/hadoop/AmazonS3).

Performance comparison

Create DDF from S3 with PyClient image

Create dataset from S3 in BigApps image

Related:

Reviewers:

NA as s3a should be a drop in replacement for s3n.

PR Progress

Make sure all checkboxes below are checked before merged

PangZhi commented 8 years ago

@lebinh will this break other parts like redshift and s3?

lebinh commented 8 years ago

So far no, all tests passed in PyClient test (2 cases failed because of missing data file on HDFS for that instance). UT for DDF passed on my machine but somehow failed on Jenkins, not sure why yet.

hai-adatao commented 8 years ago

Hm... this looks familiar, I think @Huandao0812 may know something about this, he encountered it once