haswelliris commented 7 months ago

评测机只跑seele，但是遇到了很多不合理的问题：

评测机环境

双路EPYC-9654 96核CPU，系统共192核384线程 2.24T内存
Ubuntu22.04系统配合kubernetes

1. WALL_TIME超时但是user和kernel time都很短

问题描述

例如一个简单的hello world C代码，上传文件，编译，运行共三个子任务

# 运行步骤的返回
{'status': 'FAILED',
 'report': {
  'run_at': '2024-04-12T02:39:41.860021413Z', 
  'time_elapsed_ms': 58616, 
  'type': 'run_container', 
  'status': 'WALL_TIME_LIMIT_EXCEEDED', 
  'exit_code': 0, 
  'wall_time_ms': 12045, 
  'cpu_user_time_ms': 65, 
  'cpu_kernel_time_ms': 104, 
  'memory_usage_kib': 21156}, 
'embeds': {'cis_stdout': 'hello,world2\n', 'cis_stderr': ''}}

可以看到它已经正确输出hello,world2了，整个cpu_user+kernel不到200ms，但是wall_time有整整12s，即便如此，整个time_elapsed_ms却来到了58s，超时被强行结束的

复现条件

并发大于400 (也就是大约超过cpu线程数的时候)
是否有编译子任务不影响问题出现（python这些提交文件-执行的也会遇到）（编译子任务启用cache也会遇到）
讨论

在我们评测机上并发200的时候能以30 tasks/s的速度完成这个，但这种简单的helloworld期望应该是200 tasks/s以上的评测速度。其原因都是wall_time太长导致的。
我感觉应该是runj在面对高并发的时候本身成为瓶颈了：启动、退出容器都很慢

2. runj error: cannot start an already running container

问题描述

跟1一样的C helloworld，上传文件-编译-ls编译结果-执行，结果runj报错： Error initializing the container process: cannot start an already running container

# 运行步骤的返回
{'id': 'F3oBscsWNXF7M6EQ', 'type': 'ERROR', 
'error': 'Error executing the submission: Execution got following internal error(s):\nThe runj process failed: time="2024-04-12T03:04:23Z" level=fatal msg="Error executing the container" error="Error initializing the container process: cannot start an already running container"\n'}
{'id': 'R7W7UmXlrBmQeWRr', 'type': 'ERROR',
'error': 'Error executing the submission: Execution got following internal error(s):\nThe runj process failed: time="2024-04-12T03:04:23Z" level=fatal msg="Error executing the container" error="Error initializing the container process: cannot start an already running container"\n'}

复现条件

600并发的时候约有5%的概率出现

3.compile编译阶段saves的文件无法被run阶段执行

问题描述

跟1一样的C helloworld，共4个子任务：上传文件->编译->ls编译结果->执行 编译和ls编译结果子任务的action都是 "seele/run-judge/compile@1":
编译： source: solution.cpp saves:solution.cpp,solution command: g++ solusion.cpp -i solution
ls编译结果： source: solution.cpp,solution saves:solution.cpp,solution command: ls -la
特别注意ls编译结果子任务的source和saves都是一样的

# 运行步骤的返回
{'status': 'FAILED', 
'report': {
  'run_at': '2024-04-12T05:33:22.452141283Z', 
  'time_elapsed_ms': 2343, 
  'type': 'run_container', 
  'status': 'RUNTIME_ERROR', 
  'exit_code': 1, 'wall_time_ms': 2190, 'cpu_user_time_ms': 16, 'cpu_kernel_time_ms': 50, 'memory_usage_kib': 21408}, 
  'embeds': {
    'cis_stderr': 'exec ./solution: exec format error\n', 
    'cis_stdout': ''
  }
}

复现条件

200并发的时候100%出现
1000并发的时候变成问题1超时和问题2无法执行了但是，如果ls编译结果步骤 saves改为只有solution（之前是solution和solution.cpp）就能不会出现exec format error

讨论

上面的问题1、2应该是runj对系统资源依赖导致并发上不去，这个问题我是真的不懂了
以及它在高并发的时候又不会出现，只是变成超时，太迷惑了

darkyzhou commented 7 months ago

可以提供三个问题下的 Submission 任务描述吗
对于第一和第二个问题，seele 提供了 OpenTelemetry 的 tracing 数据导出，可否麻烦你搭建一个 Grafana 和 Grafana Tempo 收集 seele 在执行评测任务过程中产生的 tracing 数据，这样可以查看每个提交（特别是 TLE 提交）的具体执行和耗时情况

haswelliris commented 7 months ago

关于tracing和metrics，配置项collector_url只有一个，url填到tempo之后似乎查不到metrics信息了
在用tempo收集trace的时候，怎么才能同时拿到https://github.com/darkyzhou/seele/blob/main/docs/public/grafana.png 这样的metrics呢？

haswelliris commented 7 months ago

三个问题的示例，这里用plain提交了，原始代码为：

#include<iostream>
using namespace std;
int main() {
    cout<<"hello,world2"<<endl;
    return 0;
}

第一个示例

上传文件，编译，运行共三个子任务

steps:
  prepare:
    action: "seele/add-file@1"
    files:
      - path: "solution.cpp"
        base64: "I2luY2x1ZGU8aW9zdHJlYW0+CnVzaW5nIG5hbWVzcGFjZSBzdGQ7CmludCBtYWluKCkgewogICAgaW50IGk9MDsKICAgIHdoaWxlKGk8MTAwMDAwMDAwKSB7CiAgICAgICAgaSsrOwogICAgfQogICAgY291dDw8ImhlbGxvLHdvcmxkMiI8PGVuZGw7CiAgICByZXR1cm4gMDsKfQ"

  compile:
    action: "seele/run-judge/compile@1"
    image: "gcc:11-bullseye"
    command: "g++ solution.cpp -o solution"
    container_uid: 65534
    container_gid: 65534
    sources: ["solution.cpp",]
    saves: ["solution.cpp","solution",]
    paths: []
    fd:
      stdout: "compile_stdout.txt"
      stderr: "compile_stderr.txt"
    report:
      embeds:
        - path: "compile_stdout.txt"
          field: compile_stdout
          truncate_kib: 16384
        - path: "compile_stderr.txt"
          field: compile_stderr
          truncate_kib: 16384
    cache:
      enabled: true
      extra: ["cache1",]

  run:
    action: "seele/run-judge/run@1"
    image: "gcc:11-bullseye"
    command: "./solution"
    container_uid: 65534
    container_gid: 65534
    paths: []
    files: ["solution.cpp","solution",]
    fd:
      stdout: "cis_stdout.txt"
      stderr: "cis_stderr.txt"
    report:
      embeds:
        - path: "cis_stdout.txt"
          field: cis_stdout
          truncate_kib: 4096
        - path: "cis_stderr.txt"
          field: cis_stderr
          truncate_kib: 4096
    limits:
      time_ms: 10000
      memory_kib: 262144
      pids_count: 32
      fsize_kib: 65536

第二、三个示例

上传文件，编译，"ls -la"编译结果-运行共4个子任务

steps:
  prepare:
    action: "seele/add-file@1"
    files:
      - path: "solution.cpp"
        base64: "I2luY2x1ZGU8aW9zdHJlYW0+CnVzaW5nIG5hbWVzcGFjZSBzdGQ7CmludCBtYWluKCkgewogICAgaW50IGk9MDsKICAgIHdoaWxlKGk8MTAwMDAwMDAwKSB7CiAgICAgICAgaSsrOwogICAgfQogICAgY291dDw8ImhlbGxvLHdvcmxkMiI8PGVuZGw7CiAgICByZXR1cm4gMDsKfQ"

  compile:
    action: "seele/run-judge/compile@1"
    image: "gcc:11-bullseye"
    command: "g++ solution.cpp -o solution"
    container_uid: 65534
    container_gid: 65534
    sources: ["solution.cpp",]
    saves: ["solution.cpp","solution",]
    paths: []
    fd:
      stdout: "compile_stdout.txt"
      stderr: "compile_stderr.txt"
    report:
      embeds:
        - path: "compile_stdout.txt"
          field: compile_stdout
          truncate_kib: 16384
        - path: "compile_stderr.txt"
          field: compile_stderr
          truncate_kib: 16384
    cache:
      enabled: true
      extra: ["cache1",]

  compile2:
    action: "seele/run-judge/compile@1"
    image: "gcc:11-bullseye"
    command: "ls -la"
    container_uid: 65534
    container_gid: 65534
    sources: ["solution.cpp","solution",]
    saves: ["solution.cpp","solution",]
    paths: []
    fd:
      stdout: "compile_stdout.txt"
      stderr: "compile_stderr.txt"
    report:
      embeds:
        - path: "compile_stdout.txt"
          field: compile_stdout
          truncate_kib: 16384
        - path: "compile_stderr.txt"
          field: compile_stderr
          truncate_kib: 16384
    cache:
      enabled: true
      extra: ["cache2",]

  run:
    action: "seele/run-judge/run@1"
    image: "gcc:11-bullseye"
    command: "./solution"
    container_uid: 65534
    container_gid: 65534
    paths: []
    files: ["solution.cpp","solution",]
    fd:
      stdout: "cis_stdout.txt"
      stderr: "cis_stderr.txt"
    report:
      embeds:
        - path: "cis_stdout.txt"
          field: cis_stdout
          truncate_kib: 4096
        - path: "cis_stderr.txt"
          field: cis_stderr
          truncate_kib: 4096
    limits:
      time_ms: 10000
      memory_kib: 262144
      pids_count: 32
      fsize_kib: 65536

haswelliris commented 7 months ago

另外根据tracing结果来看，耗时长的（已经去除本身死循环那些代码）,主要长在event: { "value": "Bound the runj container to cpu 338", "key": "message" } 不过不知道这里算上了等待时间吗？如果并发太大，等待时间被算进去的话倒是合理

darkyzhou commented 7 months ago

Tempo 是用来查询 Tracing 数据的，Metrics 需要使用其它方案，例如使用 opentelemetry-collector 收集 seele 的 Metrics 数据导出至 Prometheus，再使用 Grafana 查询。当然也可以使用 Grafana 自家的 Metrics 方案。参见： https://seele.darkyzhou.net/configurations/file#telemetry-%E9%85%8D%E7%BD%AE
第三个问题可能和 cache 有关，你可以尝试关掉每个步骤的缓存再尝试一下
从 Tracing 结果来看，4m19s 发生 Bound the runj container to cpu 338 说明直到此时，当前提交才排队排到能用的 CPU，正式开始执行 runj。4m43s 发生 Run container completed，说明 runj 花了二十秒才将容器执行完毕，这里看上去就是 runj 花了太长时间执行容器导致了问题。

印象里，高并发下 runj 底层的 runc 确实可能遇到性能问题，或许与 https://github.com/opencontainers/runc/issues/3181 有关。
对于 Linux 内核，高并发地创建容器所需的各种命名空间（尤其是用户命名空间）会带来较大的性能负担。或许你可以尝试转而在这台机器上运行 2 个或 4 个虚拟机实例来分辨运行 seele（为了公平性考虑，需要对每台虚拟机使用的 CPU 进行绑核操作）。
我近期有空也会看一下 runj 在高并发下的表现

你可以尝试一下 seele 0.3.0 版本，这个版本中 runj 升级到了最新的 1.1.12。

haswelliris commented 7 months ago

非常感谢建议。我在试用k8s先拉起多个kata容器，然后在kata容器里面跑seele。不过使用kata(或者虚拟机)，导致系统要开虚拟化，要弹性扩容得其他机器都开虚拟化，对我们现阶段架构不是很友好。但隔离kernel的方案目前除了虚拟机和kata container我想不到什么其他更好的了。更进一步讨论：如果使用kata的话，是不是可以把底层runj改成用kata了，相对于设置namespace带来的问题，kata只有启动阶段固定的秒级别的开销，其他时候安全性也更高(攻击要同时击穿低权限限制和虚拟化限制），启动评测线程也变成调用k8s api启动kata container，这样seele自身要求的权限就变低了。不过问题是要求整个k8s支持kata架构，而云上大概率会面临嵌套虚拟化的问题，裸金属又会面临要不要开虚拟化的问题

darkyzhou commented 7 months ago

其实 Seele 能够支持在 Kubernetes 上使用 runj 创建容器，只需要将文档中提到的 docker 参数修改为对应的 Pod 配置项，同时配置 Seele 所在 Pod 的 CPU 和内存资源分配即可。

除了云原生的路径，其实还有一种比较传统的办法，就是利用虚拟化平台管理多台虚拟机，利用 Ansible 在每台虚拟机上运行 Seele

darkyzhou / seele

关于高并发和性能的问题 #14

评测机环境

1. WALL_TIME超时但是user和kernel time都很短

问题描述

复现条件

讨论

2. runj error: cannot start an already running container

问题描述

复现条件

3.compile编译阶段saves的文件无法被run阶段执行

问题描述

复现条件

讨论

第一个示例

第二、三个示例