aliyun / aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
http://www.aliyun.com/product/emapreduce
Artistic License 2.0
168 stars 88 forks source link

Timeout error occurred when use hadoop FileSystem to rename big file with JindoFS SDK #465

Closed chinhuko closed 4 years ago

chinhuko commented 4 years ago

Describe the bug

I want to get a better performance when operating oss files on aliyun oss. Based on the document, I make some configs following the tips on GitHub.

I used the following code to test rename operation:

  protected Map<String, Object> run(Map<String, Object> args) throws IOException {
    Configuration configuration = new Configuration();
    configuration.set("fs.AbstractFileSystem.oss.impl", "com.aliyun.emr.fs.oss.OSS");
    configuration.set("fs.oss.impl", "com.aliyun.emr.fs.oss.JindoOssFileSystem");
    configuration.set("fs.jfs.cache.oss-accessKeyId", accessKeyId);
    configuration.set("fs.jfs.cache.oss-accessKeySecret", accessKeySecret);
    configuration.set("fs.jfs.cache.oss-endpoint", ossEndpoint);
    try {
      FileSystem fileSystem = FileSystem.get(URI.create("oss://***/"), configuration);
      log.info("start move file");
      fileSystem.rename(new Path("/ethan.ge/sourceDir/bigfile"), new Path("/ethan.ge/targetDir/bigfile"));
      log.info("moved file");
    } catch (Exception e) {
      throw new RuntimeException(String.format("failed to connect to HDFS: %s"), e);
    }
    return null;
  }

When I test rename small files like 5k, 1m, 5m, 10m, is works ok, and can see operation time range from 200ms to 500ms on the console output. When I test a bigfile below, something works wrong.

image

I ran successfully about five or six times, but after that, the code always threw timeout error. The sucess log and error log are listed below:

To Reproduce

Steps to reproduce the behavior:

  1. Configure pom.xml In IDEA like below:
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.9.2</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
    <dependency>
      <groupId>bigboot</groupId>
      <artifactId>jindofs</artifactId>
      <version>0.0.1</version>
      <exclusions>
        <exclusion>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
        </exclusion>
      </exclusions>
      <scope>system</scope>
      <systemPath>${project.basedir}/src/main/resources/libs/jindofs-sdk-2.7.1.jar</systemPath>
    </dependency>
  2. Run code below.
  3. See error.

Expected behavior

I want to find out why it was sucessful the first several tests, but why it was failed after that. Based on the test result, it cost 4.86s mv 5gb file. So it could not cost 30000ms and timeout.

And the strange thing, why it was successful for the first several times? I want to know why the problem occurs and how to solve it.

Thank you very much if you can help to figure out the problem.

Desktop (please complete the following information):

chinhuko commented 4 years ago

Hope to get your help. @uncleGen

uncleGen commented 4 years ago

cc @drankye

wsu13 commented 4 years ago

@chinhuko Could you try set client.oss.timeout.millisecond to a bigger value in bigboot.cfg ? And set fs.jfs.cache.copy.simple.max.byte to -1 in core-site.xml ?

chinhuko commented 4 years ago
wsu13 commented 4 years ago

@chinhuko It looks like a network issue. Please retry later.

uncleGen commented 4 years ago

We're closing this issue because it hasn't been updated in a while. If you still have any questions, please reopen it!