Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.75k stars 2.92k forks source link

Bugs found during the use of Alluxio Java SDK #18540

Open Never-D opened 4 months ago

Never-D commented 4 months ago

Alluxio Version: server version is 2.9.3 (Java SDK Reference Method: implementation ("org.alluxio:alluxio-shaded-client:2.9.3"))

Describe the bug My service encapsulates an HTTP protocol file download interface using SDK and uses nginx as the reverse proxy before me. When testing this HTTP interface, it was found that there were partial download errors when the concurrency was between 100 and 1000(The downloaded package is approximately 160MB). The specific error message is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

My code is as follows:

public void testAlluxioDownload(HttpServletResponse response, String path) {
    AlluxioProperties alluxioProperties = new AlluxioProperties();
    alluxioProperties.set(PropertyKey.MASTER_EMBEDDED_JOURNAL_ADDRESSES, alluxioConfig.getMasterAddress());
    InstancedConfiguration conf = new InstancedConfiguration(alluxioProperties);
    FileSystem fileSystem = FileSystem.Factory.create(conf);
    URIStatus status;
    AlluxioURI alluxioURI = new AlluxioURI(path);
    try {
        if (!fileSystem.exists(alluxioURI)) {
            fileSystem.loadMetadata(alluxioURI);

            if (!fileSystem.exists(alluxioURI)) {
                HttpTool.setFailedResponseMessage(response, HttpStatus.NOT_FOUND.value(), "文件不存在");
                return;
            }
        }
        status = fileSystem.getStatus(alluxioURI);
    } catch (Throwable e) {
        throw new Throwable(e);
    }
    try (FileInStream fileInputStream = fileSystem.openFile(alluxioURI);
         ServletOutputStream outputStream = response.getOutputStream();
    ) {
        response.setContentType(MediaType.APPLICATION_OCTET_STREAM_VALUE);
        long fileSize = status.getLength();
        if (fileSize != -1) {
            response.setHeader("Content-Length", String.valueOf(fileSize));
        }
        response.setHeader(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename="
                + new String(alluxioURI.getName().getBytes(StandardCharsets.UTF_8)));
        IOUtils.copy(fileInputStream, outputStream, 1024);
        response.setStatus(HttpStatus.OK.value());
    } catch (OpenDirectoryException e) {
        throw new Throwable(e);
    } catch (FileDoesNotExistException e) {
        throw new Throwable(e);
    } catch (FileIncompleteException e) {
        throw new Throwable(e);
    } catch (Throwable e) {
        throw new Throwable(e);
    }
}

To Reproduce Start a springboot service using the above code and SDK, and execute the wget command concurrently to reproduce the bug scenario I described

Expected behavior Hope to provide a solution or repair plan

Urgency This bug has caused our alloxio to be unable to provide high concurrency file downloads, seriously affecting usage

Are you planning to fix it I am currently unsure if the problem is caused by missing configuration of client or server parameters, or if there are bugs in the code itself, so I do not have a repair plan or plan yet

Additional context If you are unable to repair it in a timely manner, you can also send me the repair or solution, and I can try to repair it myself

YichuanSun commented 4 months ago

Could you show your alluxio configuration? Then I can find whether it is just a configuration problem.

Never-D commented 4 months ago

@YichuanSun Master and worker node information: 4C 32G

master configuration:

alluxio.master.hostname=${localip}
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200  
alluxio.master.mount.table.root.ufs=cos://lt-cubesats-alluxio-prod/alluxio/
fs.cos.access.key=${cos_cubesats_alluxio_accessKeyId}
fs.cos.app.id=1259571579
fs.cos.connection.max=4096
fs.cos.connection.timeout=50sec
fs.cos.region=ap-nanjing
fs.cos.secret.key=${cos_cubesats_alluxio_secretKey}
fs.cos.socket.timeout=50sec

#开启自动加载缓存并配置缓存目录:该目录在上传后和从对象存储发现后 马上进行缓存
alluxio.master.data.async.cache.enabled=true
alluxio.master.data.async.cache.file.path=/shein-os/cos-alluxio/data

alluxio.user.file.replication.durable=2
alluxio.master.worker.timeout=180sec
# 元数据刷新间隔
alluxio.user.file.metadata.sync.interval=30min
# 元数据管理
alluxio.master.metastore.dir=/data01/metastore
alluxio.master.journal.folder=/data01/journal
alluxio.security.authorization.permission.enabled=false
# 用户模拟
alluxio.master.security.impersonation.hadoop.users=*
alluxio.master.security.impersonation.hadoop.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.yarn.users=*
alluxio.master.security.impersonation.yarn.groups=*

# 禁止local缓存 alluxio上远程存储数据
#alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false  
# fuse监控 配置
alluxio.fuse.web.enabled=true
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.CapacityBasedDeterministicHashPolicy
alluxio.user.client.cache.enabled=true
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.client.cache.dirs=/home/hadoop
alluxio.user.client.cache.size=10GB
alluxio.user.client.cache.page.size=4MB

alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.client.cache.timeout.duration=30min
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB

#alluxio.user.block.size.bytes.default=16MB

# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.master.rpc.executor.max.pool.size=4000
alluxio.master.rpc.executor.core.pool.size=4000

alluxio.user.network.data.timeout.ms=30min
alluxio.user.streaming.data.timeout=30min

worker configuration:

alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200   

# 缓存配置 启用二级缓存
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
alluxio.worker.tieredstore.level0.dirs.quota=1GB
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.path=/data01/alluxio-namespace
alluxio.worker.tieredstore.level1.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level1.dirs.quota=700GB

# 禁止local缓存 alluxio上远程存储数据
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.70
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.70

alluxio.security.authorization.permission.enabled=false
alluxio.network.ip.address.used=true

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false

alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.close.timeout=30s

alluxio.consul.enabled=true
alluxio.consul.url=http://xxxx
alluxio.consul.service.name=prod-alluxio-server-ci-east-worker
alluxio.service.env.type=prod
alluxio.consul.service.tag=type=type=worker,disk=ssd,model=m6,cmdb-app-name=ci-alluxio,cmdb-name=ci-alluxio-cneast-prod-main

# worker Web处理的线程调大
alluxio.web.threads=4000

alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.worker.management.load.detection.cool.down.time=60sec
alluxio.worker.free.space.timeout=180sec
alluxio.worker.master.periodical.rpc.timeout=30min
alluxio.worker.memory.size=21GB
alluxio.worker.network.block.reader.threads.max=4000
alluxio.worker.network.keepalive.time=30min
alluxio.worker.network.keepalive.timeout=30min
alluxio.worker.network.permit.keepalive.time=30min
alluxio.worker.network.netty.worker.threads=8
alluxio.worker.block.master.client.pool.size=30
alluxio.worker.rpc.executor.core.pool.size=4000
alluxio.worker.rpc.executor.max.pool.size=4000

The client only configured master information. In addition, we found that when providing an HTTP download interface on the worker node, if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

The code for the download interface is as follows:

  @GET
  @Path(PATH_PARAM)
  @ApiOperation(value = "Download the given file at the path", response = java.io.InputStream.class)
  @Produces(MediaType.APPLICATION_OCTET_STREAM)
  public Response downloadFile(@PathParam("path") final String path) throws IOException, AlluxioException {
    AlluxioURI uri = new AlluxioURI("/" + path);
    FileInStream is;
    URIStatus status;
    try {
      if (!mFsClient.exists(uri)) {
        mFsClient.loadMetadata(uri);

        if (!mFsClient.exists(uri)) {
          return Response.noContent().build();
        }
      }

//      is = mFsClient.openFile(uri);
      status = mFsClient.getStatus(uri);
    } catch (IOException | AlluxioException e) {
      return Response.status(500).entity(e.getMessage()).build();
    }

    StreamingOutput fileStream = output -> {
      try (FileInStream input = mFsClient.openFile(uri)) {
        byte[] buffer = new byte[1024];
        int length;
        while ((length = input.read(buffer)) != -1) {
          output.write(buffer, 0, length);
          output.flush();
        }
      } catch (AlluxioException e) {
        throw new RuntimeException(e);
      }
    };

      try {
        return Response.ok(fileStream)
            .header("Content-Disposition", "attachment; filename=" + uri.getName())
            .header("Content-Length", status.getLength())
            .build();
      } catch (Exception e) {
        return Response.status(500).entity(e.getMessage()).build();
      }
  }
jasondrogba commented 4 months ago

Have you found errors in the master, worker and proxy logs? Can you share the logs?

if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

alluxio.user.block.master.client.pool.size.max You can try increasing this property.

Never-D commented 4 months ago

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

Never-D commented 4 months ago

@jasondrogba Is there any solution to the concurrency issue caused by adding a download interface similar to a proxy node to the worker node?

jasondrogba commented 4 months ago

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

@Never-D I guess this error message comes from springboot? It may be that the timeout period of the tomcat configuration or nginx configuration is too small. https://stackoverflow.com/questions/43825908/org-apache-catalina-connector-clientabortexception-java-io-ioexception-apr-err

Most likely, your server is taking too long to respond and the client is getting bored and closing the connection. A bit more explanation: tomcat receives a request on a connection and tries to fulfill it. Imagine this takes 3 minutes, now, if the client has a timeout of say 2 minutes, it will close the connection and when tomcat finally comes back to try to write the response, the connection is closed and it throws an org.apache.catalina.connector.ClientAbortException.

I think you can increase the timeout of springboot server and nginx, or increase the CPU and memory of the alluxio node. You can share the alluxio log, and let’s take a look at what causes the concurrent processing timeout. It’s difficult for us to determine the cause just by the error report you shared. Have you found any errors in the master.log and worker.log under alluxio/logs? I hope you can share the errors in the alluxio logs.

Never-D commented 4 months ago

@jasondrogba 2024-03-19 11:18:58,227 INFO ALLUXIO-PROXY-WEB-SERVICE-224 - Alluxio S3 API received GET request: URI=http://alluxio-test-proxy.dev.sheincorp.cn/api/v1/paths/%2Fshein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz/download-file User=null Media Type=null Query Parameters={} Path Parameters={} 2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

Never-D commented 4 months ago

@jasondrogba Download 162MB file, the download size is incorrect, but there is no error message. 2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]

jasondrogba commented 4 months ago

This worker will be skipped for future read operations

According to this line, I found the error from AlluxioFileInStream you can take a look on https://github.com/Alluxio/alluxio/issues/16094 and https://github.com/Alluxio/alluxio/pull/16096

export ALLUXIO_FUSE_JAVA_OPTS="-XX:MaxDirectMemorySize=128m"

you can try to increase MaxDirectMemorySize. @secfree Hi~, do you have any idea about this error, I think you have more experience and intelligent, can help him with this issue.

YichuanSun commented 4 months ago

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

YichuanSun commented 4 months ago

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

Especially in such a high concurrency case.

secfree commented 4 months ago

Hi @Never-D

2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

Can you check the log of alluxio-worker? Generally it's caused by short of direct memory at the alluxio-worker side when reading concurrently. Increase the value of -XX:MaxDirectMemorySize for the alluxio-worker process may help.

For the IOException: Broken pipe. exception, one idea is to catch it and retry at your http service side.

Never-D commented 4 months ago

@YichuanSun The error I sent was a proxy error

Never-D commented 4 months ago

There are no error logs in the worker node

Never-D commented 4 months ago

Download 162MB file, the download size is incorrect, but there is no error message. 2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]