containerd / rust-extensions

Rust crates to extend containerd
https://containerd.io
Apache License 2.0
184 stars 73 forks source link

Change fifo Io to PIPE in shim , just do like go shim. Resovled the raw fd case problem. #276

Open jokemanfire opened 5 months ago

jokemanfire commented 5 months ago

Related I have told this question to containerd . But looks like containerd will not change. So I will take a pr to change fifo to pipe. I have complete this code , after some ci test ,I will submit this pr.

jokemanfire commented 3 months ago

I found another two problem, when use fifo directly.

  1. ctr run -d busybox:latest test , it status will be stopping directly , but go shim will not.
  2. when containerd service is stop , all rshim io will broken, but not go shim.

This is a method to get this error. 1、Get a image Dockerfile like this:

FROM busybox:latest

COPY test.sh /

ENTRYPOINT ["sh","/test.sh"]

test.sh is blow this:

while true; do 
    sleep 3
    echo "hello"
    result=$?
    if [ $result -ne 0 ]; then
        date >> log.txt
        echo "echo faile . Result : $result" >> /log.txt
    fi
done

docker build get this image. use ctr import this image. 2、run a container then use rshim to run a container. 3、get this error stop containerd service . you can see the error message in this container. but go shim will not be influenced. So I think use a pipe in shim may be completely needed. This pr which I test can resolve this problem #278

friendly ping , @fuweid @mxpv @Burning1020 . Looking forward to your reply.

jokemanfire commented 1 month ago

tokio 1.40 pipe can resolve pipe problem perfect. friendly ping , @fuweid @mxpv @Burning1020

fuweid commented 1 month ago

Hi @jokemanfire , would you please file pull request to fix this? thanks

jokemanfire commented 1 month ago

@fuweid Please have a check #278

zhaodiaoer commented 1 week ago

tokio 1.40 pipe can resolve pipe problem perfect. friendly ping , @fuweid @mxpv @Burning1020

Hi @jokemanfire can you give more detail about why "tokio 1.40 pipe can resolve pipe problem perfect" ?

I have also encountered similar problem as you found: "when containerd service is stop , all rshim io will broken, but not go shim.", and I found another problem: the stdout stream of container process which comes from rust-shim is not flush at real time, flush one page in one time then delay a long time, not line-by-line, I don't know if this related to that use FIFO as process stdout directly

I am following up on this issue, please give some updates, Thanks !

jokemanfire commented 1 week ago

the stdout stream of container process which comes from rust-shim is not flush at real time, flush one page in one time then delay a long time, not line-by-line, I don't know if this related to that use FIFO as process stdout directly

This problem ,I didn't meet. Is there some method to get this problem? Use FIFO directly , will cause some problems , and the problem can learn from https://fuweid.com/post/2022-embedshim-kernel-is-my-sidecar/ . Thanks @fuweid . There 's some describe like " embedshim 同样也采用中转的方式来处理标准输入,但它直接将读写模式的有名管道交给了容器的标准输出,减少标准输出的拷贝。embedshim 插件属于 containerD 进程的一部分,一旦 containerD 重启,那么容器进程的 输入端 将收到 SIGPIPE 错误。对于这种情况,个人觉得是可以接受的。在交互模式下,用户会感知到容器引擎的停服。而线上环境的大部分场景都是采用 Headless 无交互模式,容器进程的输入端都是 /dev/null,而标准输出的状态由有名管道做持久化,不会因为 containerD 停服而出现 容器输出端 的 SIGPIPE 错误。 " I want to change FIFO to pipe, because some problems I think which is unacceptable in Rustshim. And change the 'pipe_os' to 'tokio_pipe', because the async trait which under high concurrency IO will cause the tokio_copy spwan will be residual.(I think it caused by the raw_fd, and there is a problem with implementing the Asynchronous trait) The Rustshim can't be delete successful.If there are some replication methods here, I would be happy to determine if the problem is caused by FIFO IO.

zhaodiaoer commented 1 week ago

the stdout stream of container process which comes from rust-shim is not flush at real time, flush one page in one time then delay a long time, not line-by-line, I don't know if this related to that use FIFO as process stdout directly

This problem ,I didn't meet. Is there some method to get this problem?

I didn't do any special thing before i encounter this problem, I have a program with high frequency log out, and when I follow logs via crictl logs -f xxx I got very long delay between intermittent output, after some investigating i found that log file produced from containerd-cri also intermittent, I guess some abnormal thing from new way of using FIFO or rust tokio runtime.

Simple diagram:

Go shim: |fifo reader| <-- fifo --> |io copier| <-- pipe --> |container process| Rust shim: |fifo reader| <-- fifo --> |container process|

The fifo and fifo reader are from containerd-cri and have no difference, i guess problem comes from second half

zhaodiaoer commented 1 week ago

the stdout stream of container process which comes from rust-shim is not flush at real time, flush one page in one time then delay a long time, not line-by-line, I don't know if this related to that use FIFO as process stdout directly

I think maybe I've found the cause. I'll try to file a PR about it later.

analytically commented 1 week ago

Seeing level=error msg="copy io failed Input/output error (os error 5)" when running this, could this be related?

jokemanfire commented 1 week ago

copy io failed Input/output

If you patched #278 ? If yes, Could you provide a more detailed description or some logs . For checking if it is my patch's problem. Ps: binary io is not realize, nerdctl -t -d will fail.

analytically commented 1 week ago

Not patched. I will patch and try again.

analytically commented 1 week ago

Patched, same error, so not fixed with #278

jokemanfire commented 1 week ago

Patched, same error, so not fixed with #278

Could you support the debug log? It may caused by copy_console (tty) , there is no more information, so it cannot be determined.

analytically commented 1 week ago

Image

This is what I could see already, any idea? I'll look at it more closely on Monday

jokemanfire commented 6 days ago

Image

This is what I could see already, any idea? I'll look at it more closely on Monday

I think in the spawn_copy while the read/write side closed suddenly, it may print this. You can check it , it should occur in tokio_copy.