GoogleCloudDataproc / hadoop-connectors

Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform.
Apache License 2.0
279 stars 237 forks source link

Performance degradation when upgrading 3-2.2.8 #891

Open selimelawwa opened 1 year ago

selimelawwa commented 1 year ago

When upgrading the from hadoop3-1.9.17 to hadoop3-2.2.8 (using the shaded jar of the new version) I faced performance degradation almost doubling the time of my tests.

I also created this Stackoverflow question

I have a performance test case which I run on my fileSystem implementation which uses org.apache.hadoop.fs.FileSystem the test runs several operations [create, read, write, rename, checkIfExists, mkDir] on 100 files with multiple threads.

I ran same tests several time on both versions of the Hadoop connectors and the new [2.2.8] is showing overall slower execution time (almost 2-2.2X the old connector time).

Below is a comparison between the average execution time for each operation while using each connector version:

operation, hadoop3-1.9.17, hadoop3-2.2.8
READ       4542.71,        10171.26, (X2 old)
RENAME     1347.75,        4483.27,  (X4 old)
EXISTS     47.23,          1538.74,  (X50 old)
CREATE     570.1,          1539.81,  (X3 old)

I have checked this github issue & tried to follow the recommendation to fine tune the performance using the configs/params but failed to find any improvement.

Is there any guidelines on parameter configurations to improve the above operations time?

Or might this performance issue be due to some incompatibility in my class-path jars? Even though I am using the shaded jar can other jars interfere?

Here is a list of jars I have in my class path:

selimelawwa commented 1 year ago

My File class which has methods like write, read ...etc

class File {
    private String path;
    private FileSystem fs;

}

Here is how my write method is implemented

@Override
    public OutputStream write(boolean overwriteIfExists) throws IOException {
        return fs.create(path, overwriteIfExists);
    }

And my read method:

 @Override
    public InputStream read() throws IOException {
        return fs.open(path);
    }

My test case simply creates many threads each has different a different instance of a file object which has different path (path to a unique GCS bucket object, path i.e gs://some-bucket/objectX) and then do read operation in example.