Alluxio / alluxio

Alluxio, data orchestration for analytics and machine learning in the cloud
https://www.alluxio.io
Apache License 2.0
6.84k stars 2.94k forks source link

Alluxio master getting slower overtime #14414

Open AlexChenSkyBlue opened 2 years ago

AlexChenSkyBlue commented 2 years ago

Alluxio Version: What version of Alluxio are you using? 2.6.2 Describe the bug A clear and concise description of what the bug is. max memory = 128394M total memory = 128394M free memory = 40585M 2021-11-03 23:42:39,277 WARN JvmPauseMonitor - JVM paused 71574ms No GCs detected max memory = 127613M total memory = 127613M free memory = 39892M 2021-11-03 23:45:03,025 WARN JvmPauseMonitor - JVM paused 72740ms No GCs detected max memory = 127613M total memory = 127613M free memory = 38611M 2021-11-03 23:47:23,373 WARN JvmPauseMonitor - JVM paused 75341ms No GCs detected max memory = 127613M total memory = 127613M free memory = 39096M 2021-11-03 23:49:47,031 WARN JvmPauseMonitor - JVM paused 73651ms No GCs detected max memory = 127613M total memory = 127613M free memory = 38156M 2021-11-03 23:49:47,550 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 76713 ms (>=10000 ms) 2021-11-03 23:49:47,582 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 76721 ms (>=10000 ms) 2021-11-03 23:51:56,966 WARN JvmPauseMonitor - JVM paused 61929ms No GCs detected max memory = 127613M total memory = 127613M free memory = 37226M 2021-11-03 23:51:58,545 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 64997 ms (>=10000 ms) 2021-11-03 23:51:58,602 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 65121 ms (>=10000 ms) 80 2021-11-03 23:54:17,065 WARN JvmPauseMonitor - JVM paused 76092ms 79 No GCs detected 78 max memory = 127613M total memory = 127613M free memory = 38174M 77 2021-11-03 23:57:22,692 WARN JvmPauseMonitor - JVM paused 117619ms 76 No GCs detected 75 max memory = 127613M total memory = 127613M free memory = 36713M 74 2021-11-04 00:00:58,209 WARN JvmPauseMonitor - JVM paused 149508ms 73 No GCs detected 72 max memory = 127613M total memory = 127613M free memory = 36037M 71 2021-11-04 00:01:00,638 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/lidar_point_bev) returned OK in 152506 ms (>=10000 ms) 70 2021-11-04 00:01:00,687 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/lidar_point_bev) returned OK in 152550 ms (>=10000 ms) 69 2021-11-04 00:03:13,582 WARN JvmPauseMonitor - JVM paused 75367ms 68 No GCs detected 67 max memory = 127613M total memory = 127613M free memory = 34945M 66 2021-11-04 00:05:21,757 WARN JvmPauseMonitor - JVM paused 68169ms 65 No GCs detected 64 max memory = 127613M total memory = 127613M free memory = 34490M 63 2021-11-04 00:07:42,312 WARN JvmPauseMonitor - JVM paused 79549ms 62 No GCs detected 61 max memory = 127613M total memory = 127613M free memory = 34791M 60 2021-11-04 00:11:25,529 WARN JvmPauseMonitor - JVM paused 165211ms 59 No GCs detected 58 max memory = 127613M total memory = 127613M free memory = 33878M 57 2021-11-04 00:13:29,852 WARN UnderFileSystemWithLogging - GetFileLocations(path=s3://alluxio/1618363628.204169035.bin, options=FileLocationOptions{offset=0}) returned OK in 66567 ms (>=10000 ms) 56 2021-11-04 00:13:29,858 WARN JvmPauseMonitor - JVM paused 66324ms 55 No GCs detected 54 max memory = 127613M total memory = 127613M free memory = 35059M 53 2021-11-04 00:17:12,707 WARN JvmPauseMonitor - JVM paused 161842ms 52 No GCs detected 51 max memory = 127613M total memory = 127613M free memory = 34177M 50 2021-11-04 00:19:31,281 WARN UnderFileSystemWithLogging - GetFileLocations(path=s3://alluxio/1618363807.804277897.txt, options=FileLocationOptions{offset=0}) returned OK in 80235 ms (>=10000 ms) 49 2021-11-04 00:19:31,286 WARN JvmPauseMonitor - JVM paused 79572ms 48 No GCs detected 47 max memory = 127613M total memory = 127613M free memory = 33391M 46 2021-11-04 00:23:03,666 WARN JvmPauseMonitor - JVM paused 154374ms 45 No GCs detected 44 max memory = 127613M total memory = 127613M free memory = 32345M 43 2021-11-04 00:25:15,177 WARN JvmPauseMonitor - JVM paused 79502ms 42 No GCs detected 41 max memory = 127613M total memory = 127613M free memory = 33529M 40 2021-11-04 00:28:48,759 WARN JvmPauseMonitor - JVM paused 159576ms 39 No GCs detected 38 max memory = 127613M total memory = 127613M free memory = 32483M 37 2021-11-04 00:31:05,030 WARN JvmPauseMonitor - JVM paused 82265ms 36 No GCs detected 35 max memory = 127613M total memory = 127613M free memory = 31996M 34 2021-11-04 00:33:39,462 WARN JvmPauseMonitor - JVM paused 100427ms 33 No GCs detected 32 max memory = 127613M total memory = 127613M free memory = 33012M 31 2021-11-04 00:35:42,937 WARN JvmPauseMonitor - JVM paused 67469ms 30 No GCs detected 29 max memory = 127613M total memory = 127613M free memory = 32289M 28 2021-11-04 00:35:44,083 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 71331 ms (>=10000 ms) 27 2021-11-04 00:35:44,241 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_relative_txt) returned OK in 71317 ms (>=10000 ms) 26 2021-11-04 00:37:48,030 WARN JvmPauseMonitor - JVM paused 70086ms 25 No GCs detected 24 max memory = 127613M total memory = 127613M free memory = 31398M 23 2021-11-04 00:37:51,289 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_dense) returned OK in 73860 ms (>=10000 ms) 22 2021-11-04 00:37:51,296 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/ego_tf_dense) returned OK in 73866 ms (>=10000 ms) 21 2021-11-04 00:39:49,528 WARN JvmPauseMonitor - JVM paused 68493ms 20 No GCs detected 19 max memory = 127613M total memory = 127613M free memory = 30788M 18 2021-11-04 00:39:49,709 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/road_labels_width) returned OK in 69612 ms (>=10000 ms) 17 2021-11-04 00:39:49,772 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/road_labels_width) returned OK in 69648 ms (>=10000 ms) 16 2021-11-04 00:41:49,309 WARN JvmPauseMonitor - JVM paused 67774ms 15 No GCs detected 14 max memory = 127613M total memory = 127613M free memory = 32252M 13 2021-11-04 00:41:51,028 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/future_mask) returned OK in 70082 ms (>=10000 ms) 12 2021-11-04 00:41:51,180 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/future_mask) returned OK in 70213 ms (>=10000 ms) 11 2021-11-04 00:43:46,683 WARN JvmPauseMonitor - JVM paused 64368ms 10 No GCs detected 9 max memory = 127613M total memory = 127613M free memory = 31545M 8 2021-11-04 00:47:22,952 WARN JvmPauseMonitor - JVM paused 164264ms 7 No GCs detected 6 max memory = 127613M total memory = 127613M free memory = 30935M 5 2021-11-04 00:47:23,660 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/road_intersections) returned OK in 165334 ms (>=10000 ms) 4 2021-11-04 00:47:23,718 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/road_intersections) returned OK in 165385 ms (>=10000 ms) 3 2021-11-04 00:49:38,555 WARN JvmPauseMonitor - JVM paused 84598ms 2 No GCs detected 1 max memory = 127613M total memory = 127613M free memory = 30889M To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible) alluxio.master.mount.table.root.option.alluxio.underfs.s3.threads.max: 320 alluxio.master.mount.table.root.option.alluxio.underfs.s3.upload.threads.max: 120 alluxio.master.mount.table.root.option.alluxio.underfs.s3.admin.threads.max: 100 alluxio.master.mount.table.root.option.alluxio.underfs.s3.socket.timeout: 500sec alluxio.master.mount.table.root.option.alluxio.underfs.s3.request.timeout: 5min alluxio.master.mount.table.root.option.alluxio.underfs.object.store.service.threads: 120 alluxio.master.mount.table.root.option.alluxio.underfs.s3.disable.dns.buckets: true alluxio.master.mount.table.root.option.alluxio.underfs.s3.inherit.acl: false alluxio.master.mount.table.root.option.alluxio.underfs.s3.default.mode: "0777" alluxio.worker.block.annotator.class: alluxio.worker.block.annotator.LRUAnnotator alluxio.worker.tieredstore.free.ahead.bytes: 20GB alluxio.user.file.delete.unchecked: true alluxio.master.journal.log.size.bytes.max: 500MB alluxio.master.periodic.block.integrity.check.repair: true alluxio.master.ufs.path.cache.capacity: "50000000" alluxio.master.ufs.path.cache.threads: 160 alluxio.master.metadata.sync.concurrency.level: 32 alluxio.master.metadata.sync.executor.pool.size: 32 alluxio.master.metadata.sync.ufs.prefetch.pool.size: 128 alluxio.master.rpc.executor.max.pool.size: 10240 alluxio.master.rpc.executor.core.pool.size: 128 alluxio.worker.allocator.class: alluxio.worker.block.allocator.MaxFreeAllocator alluxio.worker.network.reader.buffer.size: 32MB alluxio.worker.file.buffer.size: 4000MB alluxio.worker.block.master.client.pool.size: 1024 alluxio.user.block.worker.client.pool.min: 512 alluxio.user.file.writetype.default: ASYNC_THROUGH alluxio.user.block.size.bytes.default: 128MB alluxio.user.streaming.reader.chunk.size.bytes: 128MB alluxio.user.local.reader.chunk.size.bytes: 128MB alluxio.user.metrics.collection.enabled: true alluxio.user.update.file.accesstime.disabled: "true" alluxio.user.file.passive.cache.enabled: false alluxio.user.block.avoid.eviction.policy.reserved.size.bytes: 100GB alluxio.user.block.master.client.pool.gc.threshold: 2day alluxio.user.file.master.client.threads: 1024 alluxio.worker.data.server.domain.socket.as.uuid: true alluxio.user.file.create.ttl: 72h alluxio.user.file.create.ttl.action: FREE alluxio.user.block.master.client.threads: 1024 alluxio.user.file.readtype.default: CACHE alluxio.user.metadata.cache.enabled: "true" alluxio.user.metadata.cache.expiration.time: 24h alluxio.user.metadata.cache.max.size: "10000000" alluxio.user.direct.memory.io.enabled: "true" alluxio.user.worker.list.refresh.interval: 10min alluxio.user.logging.threshold: 1000ms alluxio.master.persistence.blacklist: temp alluxio.web.ui.enabled: "true" alluxio.job.worker.threadpool.size: 164 alluxio.user.file.replication.durable: 2 alluxio.user.network.rpc.flowcontrol.window: 10MB alluxio.user.network.rpc.max.connections: 16 alluxio.user.network.rpc.keepalive.timeout: 180sec alluxio.worker.management.tier.promote.quota.percent: 80 alluxio.worker.management.tier.align.reserved.bytes: 30GB alluxio.master.lost.worker.file.detection.interval: 10sec alluxio.user.client.cache.enabled: "true" alluxio.user.client.cache.store.type: LOCAL alluxio.user.client.cache.dir: /opt/mnt/ssd1/alluxio-cache alluxio.user.client.cache.page.size: 8MB alluxio.user.client.cache.size: 2000GB alluxio.web.threads: 20 alluxio.user.ufs.block.read.location.policy: alluxio.client.block.policy.DeterministicHashPolicy alluxio.user.streaming.writer.close.timeout: 60m alluxio.user.streaming.data.timeout: 60s alluxio.user.block.master.client.pool.gc.threshold: 1h alluxio.user.block.master.client.pool.size.max: 1024 alluxio.user.file.master.client.pool.size.max: 1024 alluxio.user.block.worker.client.pool.max: 10240 alluxio.worker.jvm.monitor.enabled: "true" alluxio.master.jvm.monitor.enabled: "true" Expected behavior A clear and concise description of what you expected to happen. We did not expect free memory getting reduced. Urgency Describe the impact and urgency of the bug.

Are you planning to fix it Please indicate if you are already working on a PR.

Additional context Add any other context about the problem here.

apc999 commented 2 years ago

@AlexChenSkyBlue it looks like two issues here:

1 From the logs "2021-11-04 00:33:39,462 WARN JvmPauseMonitor - JVM paused 100427ms", the GC is taking a lot of time. Check https://docs.alluxio.io/os/user/stable/en/operation/Performance-Tuning.html?q=GC#common-performance-issues for printing out more useful GC debugging information 2 From the log message "2021-11-04 00:39:49,709 WARN UnderFileSystemWithLogging - ListStatus(path=s3://alluxio/road_labels_width) returned OK in 69612 ms (>=10000 ms)" , certain listing operation is taking more than a minute. Not sure if there is a huge amount of objects returned

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.