Open Never-D opened 4 months ago
Could you show your alluxio configuration? Then I can find whether it is just a configuration problem.
@YichuanSun Master and worker node information: 4C 32G
master configuration:
alluxio.master.hostname=${localip}
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200
alluxio.master.mount.table.root.ufs=cos://lt-cubesats-alluxio-prod/alluxio/
fs.cos.access.key=${cos_cubesats_alluxio_accessKeyId}
fs.cos.app.id=1259571579
fs.cos.connection.max=4096
fs.cos.connection.timeout=50sec
fs.cos.region=ap-nanjing
fs.cos.secret.key=${cos_cubesats_alluxio_secretKey}
fs.cos.socket.timeout=50sec
#开启自动加载缓存并配置缓存目录:该目录在上传后和从对象存储发现后 马上进行缓存
alluxio.master.data.async.cache.enabled=true
alluxio.master.data.async.cache.file.path=/shein-os/cos-alluxio/data
alluxio.user.file.replication.durable=2
alluxio.master.worker.timeout=180sec
# 元数据刷新间隔
alluxio.user.file.metadata.sync.interval=30min
# 元数据管理
alluxio.master.metastore.dir=/data01/metastore
alluxio.master.journal.folder=/data01/journal
alluxio.security.authorization.permission.enabled=false
# 用户模拟
alluxio.master.security.impersonation.hadoop.users=*
alluxio.master.security.impersonation.hadoop.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.yarn.users=*
alluxio.master.security.impersonation.yarn.groups=*
# 禁止local缓存 alluxio上远程存储数据
#alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false
# fuse监控 配置
alluxio.fuse.web.enabled=true
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.CapacityBasedDeterministicHashPolicy
alluxio.user.client.cache.enabled=true
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.client.cache.dirs=/home/hadoop
alluxio.user.client.cache.size=10GB
alluxio.user.client.cache.page.size=4MB
alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.client.cache.timeout.duration=30min
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
#alluxio.user.block.size.bytes.default=16MB
# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec
alluxio.web.threaddump.log.enabled=true
alluxio.master.rpc.executor.max.pool.size=4000
alluxio.master.rpc.executor.core.pool.size=4000
alluxio.user.network.data.timeout.ms=30min
alluxio.user.streaming.data.timeout=30min
worker configuration:
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200
# 缓存配置 启用二级缓存
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
alluxio.worker.tieredstore.level0.dirs.quota=1GB
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.path=/data01/alluxio-namespace
alluxio.worker.tieredstore.level1.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level1.dirs.quota=700GB
# 禁止local缓存 alluxio上远程存储数据
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.70
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.70
alluxio.security.authorization.permission.enabled=false
alluxio.network.ip.address.used=true
# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.close.timeout=30s
alluxio.consul.enabled=true
alluxio.consul.url=http://xxxx
alluxio.consul.service.name=prod-alluxio-server-ci-east-worker
alluxio.service.env.type=prod
alluxio.consul.service.tag=type=type=worker,disk=ssd,model=m6,cmdb-app-name=ci-alluxio,cmdb-name=ci-alluxio-cneast-prod-main
# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec
alluxio.web.threaddump.log.enabled=true
alluxio.worker.management.load.detection.cool.down.time=60sec
alluxio.worker.free.space.timeout=180sec
alluxio.worker.master.periodical.rpc.timeout=30min
alluxio.worker.memory.size=21GB
alluxio.worker.network.block.reader.threads.max=4000
alluxio.worker.network.keepalive.time=30min
alluxio.worker.network.keepalive.timeout=30min
alluxio.worker.network.permit.keepalive.time=30min
alluxio.worker.network.netty.worker.threads=8
alluxio.worker.block.master.client.pool.size=30
alluxio.worker.rpc.executor.core.pool.size=4000
alluxio.worker.rpc.executor.max.pool.size=4000
The client only configured master information. In addition, we found that when providing an HTTP download interface on the worker node, if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.
The code for the download interface is as follows:
@GET
@Path(PATH_PARAM)
@ApiOperation(value = "Download the given file at the path", response = java.io.InputStream.class)
@Produces(MediaType.APPLICATION_OCTET_STREAM)
public Response downloadFile(@PathParam("path") final String path) throws IOException, AlluxioException {
AlluxioURI uri = new AlluxioURI("/" + path);
FileInStream is;
URIStatus status;
try {
if (!mFsClient.exists(uri)) {
mFsClient.loadMetadata(uri);
if (!mFsClient.exists(uri)) {
return Response.noContent().build();
}
}
// is = mFsClient.openFile(uri);
status = mFsClient.getStatus(uri);
} catch (IOException | AlluxioException e) {
return Response.status(500).entity(e.getMessage()).build();
}
StreamingOutput fileStream = output -> {
try (FileInStream input = mFsClient.openFile(uri)) {
byte[] buffer = new byte[1024];
int length;
while ((length = input.read(buffer)) != -1) {
output.write(buffer, 0, length);
output.flush();
}
} catch (AlluxioException e) {
throw new RuntimeException(e);
}
};
try {
return Response.ok(fileStream)
.header("Content-Disposition", "attachment; filename=" + uri.getName())
.header("Content-Length", status.getLength())
.build();
} catch (Exception e) {
return Response.status(500).entity(e.getMessage()).build();
}
}
Have you found errors in the master, worker and proxy logs? Can you share the logs?
if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.
alluxio.user.block.master.client.pool.size.max
You can try increasing this property.
@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.
@jasondrogba Is there any solution to the concurrency issue caused by adding a download interface similar to a proxy node to the worker node?
@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.
@Never-D I guess this error message comes from springboot? It may be that the timeout period of the tomcat configuration or nginx configuration is too small. https://stackoverflow.com/questions/43825908/org-apache-catalina-connector-clientabortexception-java-io-ioexception-apr-err
Most likely, your server is taking too long to respond and the client is getting bored and closing the connection. A bit more explanation: tomcat receives a request on a connection and tries to fulfill it. Imagine this takes 3 minutes, now, if the client has a timeout of say 2 minutes, it will close the connection and when tomcat finally comes back to try to write the response, the connection is closed and it throws an org.apache.catalina.connector.ClientAbortException.
I think you can increase the timeout of springboot server and nginx, or increase the CPU and memory of the alluxio node. You can share the alluxio log, and let’s take a look at what causes the concurrent processing timeout. It’s difficult for us to determine the cause just by the error report you shared. Have you found any errors in the master.log and worker.log under alluxio/logs? I hope you can share the errors in the alluxio logs.
@jasondrogba 2024-03-19 11:18:58,227 INFO ALLUXIO-PROXY-WEB-SERVICE-224 - Alluxio S3 API received GET request: URI=http://alluxio-test-proxy.dev.sheincorp.cn/api/v1/paths/%2Fshein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz/download-file User=null Media Type=null Query Parameters={} Path Parameters={} 2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.
@jasondrogba Download 162MB file, the download size is incorrect, but there is no error message. 2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]
This worker will be skipped for future read operations
According to this line, I found the error from AlluxioFileInStream you can take a look on https://github.com/Alluxio/alluxio/issues/16094 and https://github.com/Alluxio/alluxio/pull/16096
export ALLUXIO_FUSE_JAVA_OPTS="-XX:MaxDirectMemorySize=128m"
you can try to increase MaxDirectMemorySize. @secfree Hi~, do you have any idea about this error, I think you have more experience and intelligent, can help him with this issue.
One possible reason: you have to close the FileSystem
instance at the end of your code, otherwise these FileSystem
objects are leakage resource. @Never-D
One possible reason: you have to close the
FileSystem
instance at the end of your code, otherwise theseFileSystem
objects are leakage resource. @Never-D
Especially in such a high concurrency case.
Hi @Never-D
2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.
Can you check the log of alluxio-worker? Generally it's caused by short of direct memory at the alluxio-worker side when reading concurrently. Increase the value of -XX:MaxDirectMemorySize
for the alluxio-worker process may help.
For the IOException: Broken pipe.
exception, one idea is to catch it and retry at your http service side.
@YichuanSun The error I sent was a proxy error
There are no error logs in the worker node
Download 162MB file, the download size is incorrect, but there is no error message. 2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]
Alluxio Version: server version is 2.9.3 (Java SDK Reference Method: implementation ("org.alluxio:alluxio-shaded-client:2.9.3"))
Describe the bug My service encapsulates an HTTP protocol file download interface using SDK and uses nginx as the reverse proxy before me. When testing this HTTP interface, it was found that there were partial download errors when the concurrency was between 100 and 1000(The downloaded package is approximately 160MB). The specific error message is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.
My code is as follows:
To Reproduce Start a springboot service using the above code and SDK, and execute the wget command concurrently to reproduce the bug scenario I described
Expected behavior Hope to provide a solution or repair plan
Urgency This bug has caused our alloxio to be unable to provide high concurrency file downloads, seriously affecting usage
Are you planning to fix it I am currently unsure if the problem is caused by missing configuration of client or server parameters, or if there are bugs in the code itself, so I do not have a repair plan or plan yet
Additional context If you are unable to repair it in a timely manner, you can also send me the repair or solution, and I can try to repair it myself