Timeout error occurred when use hadoop FileSystem to rename big file with JindoFS SDK

chinhuko commented 4 years ago

Describe the bug

I want to get a better performance when operating oss files on aliyun oss. Based on the document, I make some configs following the tips on GitHub.

I used the following code to test rename operation:

  protected Map<String, Object> run(Map<String, Object> args) throws IOException {
    Configuration configuration = new Configuration();
    configuration.set("fs.AbstractFileSystem.oss.impl", "com.aliyun.emr.fs.oss.OSS");
    configuration.set("fs.oss.impl", "com.aliyun.emr.fs.oss.JindoOssFileSystem");
    configuration.set("fs.jfs.cache.oss-accessKeyId", accessKeyId);
    configuration.set("fs.jfs.cache.oss-accessKeySecret", accessKeySecret);
    configuration.set("fs.jfs.cache.oss-endpoint", ossEndpoint);
    try {
      FileSystem fileSystem = FileSystem.get(URI.create("oss://***/"), configuration);
      log.info("start move file");
      fileSystem.rename(new Path("/ethan.ge/sourceDir/bigfile"), new Path("/ethan.ge/targetDir/bigfile"));
      log.info("moved file");
    } catch (Exception e) {
      throw new RuntimeException(String.format("failed to connect to HDFS: %s"), e);
    }
    return null;
  }

When I test rename small files like 5k, 1m, 5m, 10m, is works ok, and can see operation time range from 200ms to 500ms on the console output. When I test a bigfile below, something works wrong.

I ran successfully about five or six times, but after that, the code always threw timeout error. The sucess log and error log are listed below:

Success log: info com.aliyun.emr.fs.common.FsStats cmd=rename, src=oss://***/ethan.ge/sourceDir/Win10_2004_English_x64.iso, dst=oss://***/ethan.ge/targetDir/Win10_2004_English_x64.iso, size=0, parameter=, time-in-ms=256, version=2.7.1.
Error log: Caused by: java.io.IOException: ErrorCode : 25201 , ErrorMsg: OSS Op Error. Reached timeout=30000ms at com.alibaba.jboot.JbootFuture.get(JbootFuture.java:141) at com.alibaba.jfs.OssFileletSystem.rename(OssFileletSystem.java:291) ... 20 common frames omitted

To Reproduce

Steps to reproduce the behavior:

Configure pom.xml In IDEA like below:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>2.9.2</version>
  <exclusions>
    <exclusion>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
    </exclusion>
  </exclusions>
</dependency>
<dependency>
  <groupId>bigboot</groupId>
  <artifactId>jindofs</artifactId>
  <version>0.0.1</version>
  <exclusions>
    <exclusion>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
    </exclusion>
  </exclusions>
  <scope>system</scope>
  <systemPath>${project.basedir}/src/main/resources/libs/jindofs-sdk-2.7.1.jar</systemPath>
</dependency>

Run code below.
See error.

Expected behavior

I want to find out why it was sucessful the first several tests, but why it was failed after that. Based on the test result, it cost 4.86s mv 5gb file. So it could not cost 30000ms and timeout.

And the strange thing, why it was successful for the first several times? I want to know why the problem occurs and how to solve it.

Thank you very much if you can help to figure out the problem.

Desktop (please complete the following information):

OS: Ubuntu 18.04.4 LTS
Memory: 15.5 GiB
Processor: Intel® Core™ i5-9400 CPU @ 2.90GHz × 6
Graphics: Intel® UHD Graphics 630 (CFL GT2)
GNOME: 3.28.2
OS type: 64-bit
Disk: 250.9 GB
IDEA version: COMMUNITY 2019.3

chinhuko commented 4 years ago

Hope to get your help. @uncleGen

uncleGen commented 4 years ago

cc @drankye

wsu13 commented 4 years ago

@chinhuko Could you try set client.oss.timeout.millisecond to a bigger value in bigboot.cfg ? And set fs.jfs.cache.copy.simple.max.byte to -1 in core-site.xml ?

chinhuko commented 4 years ago

Set client.oss.timeout.millisecond=300000 only in bigboot.cfg, and got the error below:

Caused by: java.io.IOException: ErrorCode : 504 , ErrorMsg: HTTP/1.1 504 Gateway Timeout: <?xml version="1.0" 
encoding="UTF-8"?>
<Error>
  <Code>ServiceUnavailable</Code>
  <Message>Sorry, the page you are looking for is currently unavailable. Please try again later.</Message>
  <RequestId>***</RequestId>
  <HostId>***.aliyuncs.com</HostId>
</Error>

    at com.alibaba.jboot.JbootFuture.get(JbootFuture.java:141)
    at com.alibaba.jfs.OssFileletSystem.rename(OssFileletSystem.java:291)
    ... 20 common frames omitted

Set both client.oss.timeout.millisecond=300000 and fs.jfs.cache.copy.simple.max.byte=-1, and got the same error.

wsu13 commented 4 years ago

@chinhuko It looks like a network issue. Please retry later.

uncleGen commented 4 years ago

We're closing this issue because it hasn't been updated in a while. If you still have any questions, please reopen it!

aliyun / aliyun-emapreduce-datasources

Timeout error occurred when use hadoop FileSystem to rename big file with JindoFS SDK #465