[PE-2058] Improve load from S3 speed by using s3a instead of s3n

lebinh commented 8 years ago

Description and related tickets, documents

Replace s3n file system with s3a as it is faster and is stabilized enough in Hadoop 2.7 (https://wiki.apache.org/hadoop/AmazonS3).

Performance comparison

Create DDF from S3 with PyClient

Create dataset from S3 in BigApps

https://adatao.atlassian.net/browse/PE-2058

Reviewers:

Main reviewer: @PangZhi @Huandao0812
Observers: @hai-adatao, @SeineRiver
Breaking changes & backward compatible issues

NA as s3a should be a drop in replacement for s3n.

PR Progress

Make sure all checkboxes below are checked before merged

[x] Merge check has no conflicts. PR checks passed.
[ ] Main reviewer approved
[ ] Optional reviewer approved

PangZhi commented 8 years ago

@lebinh will this break other parts like redshift and s3?

lebinh commented 8 years ago

So far no, all tests passed in PyClient test (2 cases failed because of missing data file on HDFS for that instance). UT for DDF passed on my machine but somehow failed on Jenkins, not sure why yet.

hai-adatao commented 8 years ago

Hm... this looks familiar, I think @Huandao0812 may know something about this, he encountered it once

ddf-project / DDF