ememos / GiantVM

9 stars 9 forks source link

TCP stack bypass with BPF to improve performance #19

Closed ParkHanbum closed 2 years ago

ParkHanbum commented 3 years ago

Although the connection using tcp is still incomplete, I think it is worthwhile to support tcp for general development environment and testing. When gianvm communicates with machines connected through TCP, it seems that all tcp stacks need to be loaded, and this is unnecessary in giantvm as it is not used for general purposes.

Therefore, I think that better performance can be expected if it can bypass the tcp stack through bpf and handle it directly. how do you think?

ChoKyuWon commented 3 years ago

Well, BPF is consequently JIT bytecode, right?

I search for some article about TCP stack bypass by BPF, and found this, and if I understand correctly, They use BPF because they are in USERSPACE.

In my opinion, we develop GiantVM on kernel space, so BPF is not necessary. If we want a more lightweight TCP stack, we can just implement it on C, not BPF. I think writing TCP stack as C(and assembly) is much faster than BPF.

I have no idea about using BPF in kernel space. I think BPF is for user, not kernel.

However, develop more lightweight TCP stack seems good idea, but I think we need stabilize current TCP stack first, and then optimize it after stabilization. I think current implementation also do not use Linux kernel's TCP stack, they use their own TCP implementation, arch/x86/kvm/ktcp.c and ktcp.h.

I think we can choose one of two solutions:

  1. Remove ktcp.c and just use standard TCP implementation in kernel.
    • This is good for stabilization, but not good for performance.
  2. Stablizing current implementation

I do not know which way is better.

solemnify commented 3 years ago

I don't know much about TCP over BPF though. I aggree with you that bypassing the TCP stack is very cruicial for overall performance. What do you think about the RDMA over Ethernet (ROCE or iWARP)?

ParkHanbum commented 3 years ago

Actually, I don't know much about BPF either. So I can't explain the details of my proposal. Instead, it is recommended to search for a technology called BPF/XDP. The important thing is to make TCP packets for DSM to be handled directly by bypassing the TCP stack, and I think BPF/XDP can be utilized.

Taeung commented 3 years ago

Well, BPF is consequently JIT bytecode, right?

I search for some article about TCP stack bypass by BPF, and found this, and if I understand correctly, They use BPF because they are in USERSPACE.

In my opinion, we develop GiantVM on kernel space, so BPF is not necessary. If we want a more lightweight TCP stack, we can just implement it on C, not BPF. I think writing TCP stack as C(and assembly) is much faster than BPF.

I have no idea about using BPF in kernel space. I think BPF is for user, not kernel.

BPF is a technology that can insert new kernel function code at runtime and it can be used for not only user space but also kernel space. For example, XDP and Katran(LB BPF program in kernel) https://github.com/facebookincubator/katran

However, develop more lightweight TCP stack seems good idea, but I think we need stabilize current TCP stack first, and then optimize it after stabilization. I think current implementation also do not use Linux kernel's TCP stack, they use their own TCP implementation, arch/x86/kvm/ktcp.c and ktcp.h.

I think we can choose one of two solutions:

  1. Remove ktcp.c and just use standard TCP implementation in kernel.

    • This is good for stabilization, but not good for performance.
  2. Stablizing current implementation

I do not know which way is better.

ChoKyuWon commented 3 years ago

@Taeung Well, I read a paper, and they say like that:

While it’s lucrative for userspace programs to get access to network devices and improve on some data copies by bypassing the kernel, there are some problems with that approach as well.

So even BPF or XDP, their purpose is injecting a piece of code to the kernel for userspace. Yes, kernel can also use BPF program, but why should it do?

Userspace-injected code needs sandbox because it can be malicious, so it's why they use BPF verifier. But we write the code as a kernel component, (not even kernel module!) so BPF is not necessary. As I know, most BPF program is also written in C and then LLVM compiles it to BPF bytecode. Why can't we use pure C and put it in the kernel as a native binary?

BPF is kind of virtual-machine(like WASM), so I think it's slower than native binary, so I don't know why we need to used that.

In my opinion, as @solemnify mentioned, we will focus on RoCE or iWARP, the RDMA protocol over common ethernet is more effecient.

Taeung commented 3 years ago

@Taeung Well, I read a paper, and they say like that:

While it’s lucrative for userspace programs to get access to network devices and improve on some data copies by bypassing the kernel, there are some problems with that approach as well.

So even BPF or XDP, their purpose is injecting a piece of code to the kernel for userspace. Yes, kernel can also use BPF program, but why should it do?

Userspace-injected code needs sandbox because it can be malicious, so it's why they use BPF verifier. But we write the code as a kernel component, (not even kernel module!) so BPF is not necessary.

Thanks for sharing a great paper. @ChoKyuWon I agree BPF eventually is for user. But I think the opinion is one of many BPF use cases. There are many BPF use cases (even BPF program can overwrite a return value of a specific kernel function.) For examples, FUSE, Packet Filtering of various purposes,IR-decoding ... So I think we can use BPF as customized kernel component. what do you think of it ?

As I know, most BPF program is also written in C and then LLVM compiles it to BPF bytecode. Why can't we use pure C and put it in the kernel as a native binary?

BPF is kind of virtual-machine(like WASM), so I think it's slower than native binary, so I don't know why we need to used that.

After BPF verification of inserted BPF bytecode, it can be directly translated into the host system's assembly code (JIT). So the BPF program isn't slower than native binary.

In my opinion, as @solemnify mentioned, we will focus on RoCE or iWARP, the RDMA protocol over common ethernet is more effecient.

ChoKyuWon commented 3 years ago

@Taeung

There are many BPF use cases (even BPF program can overwrite a return value of a specific kernel function.) For examples, FUSE, Packet Filtering of various purposes,IR-decoding ... So I think we can use BPF as customized kernel component. what do you think of it ?

I agree it's possible, but is that mandatory or is that a better way than modify the kernel? I agree that use BPF is the simplest way to inject some code in the kernel, but I'm not sure that that's the best way to achieve our goal: support TCP in GiantVM. Why shouldn't we just modify kernel code and TCP stack? Why shouldn't we modify the ktcp.c code? Why shouldn't we build new kernel component to bypass the current TCP stack? That's the reason why I can't understand the need on using the BPF code.

If you know examples of BPF code in kernel tree or kernel module that use BPF, please notify me. I'll check it and maybe change my mind. I read your IR-decoding article, it also seems that build decode() function on userspace and then injects the BPF code in the kernel by ir-keytable tools.

After BPF verification of inserted BPF bytecode, it can be directly translated into the host system's assembly code (JIT). So the BPF program isn't slower than native binary.

Yes, JIT is fast. But it's still slower than native binary because of compiling time. Maybe AOT(Ahead of Time) compile will run as the same speed as a native binary, but I'm not sure that BPF-AOT exists.

Taeung commented 3 years ago

@Taeung

There are many BPF use cases (even BPF program can overwrite a return value of a specific kernel function.) For examples, FUSE, Packet Filtering of various purposes,IR-decoding ... So I think we can use BPF as customized kernel component. what do you think of it ?

I agree it's possible, but is that mandatory or is that a better way than modify the kernel? I agree that use BPF is the simplest way to inject some code in the kernel, but I'm not sure that that's the best way to achieve our goal: support TCP in GiantVM.

Yep, this case is enough if we simply use BPF to improve performance about current KVM tcp. I also think we don't need that BPF program support all of TCP in GaintVM.

Why shouldn't we just modify kernel code and TCP stack? Why shouldn't we modify the ktcp.c code? Why shouldn't we build new kernel component to bypass the current TCP stack? That's the reason why I can't understand the need on using the BPF code.

I also think your opinion isn't totally wrong way. Using BPF or modifiying kernel code has it's advantages and disadvantages. But I think if we can simply improve TCP performance using BPF/XDP, we can try to do that. Why shouldn't we try to use BPF/XDP ? Is it the only best way to modify kernel code and TCP stack ?

And you know arch/x86/kvm/ktcp.c is simple like userspace socket program and eventually call functions of net/socket.c. What is your opinion how we modify ktcp.c and kernel code ?

If you know examples of BPF code in kernel tree or kernel module that use BPF, please notify me. I'll check it and maybe change my mind.

I told you the BPF examples. There are many BPF usercaes leveraging existing kernel functions by BPF helper. The BPF program can interact with kernel code even though it's limited. For example, the Interaction cases: Kernel -> BPF : https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L2210-L2238 BPF -> Kernel : https://github.com/torvalds/linux/blob/master/net/core/filter.c#L2417

I read your IR-decoding article, it also seems that build decode() function on userspace and then injects the BPF code in the kernel by ir-keytable tools.

After BPF verification of inserted BPF bytecode, it can be directly translated into the host system's assembly code (JIT). So the BPF program isn't slower than native binary.

Yes, JIT is fast. But it's still slower than native binary because of compiling time. Maybe AOT(Ahead of Time) compile will run as the same speed as a native binary, but I'm not sure that BPF-AOT exists.

Only at the beginning, it a bit takes time to translate BPF byte code into native byte code but after the translation the code run as the same speed as a native binary.

ChoKyuWon commented 3 years ago

@Taeung

I also think your opinion isn't totally wrong way. Using BPF or modifiying kernel code has it's advantages and disadvantages. But I think if we can simply improve TCP performance using BPF/XDP, we can try to do that. Why shouldn't we try to use BPF/XDP ? Is it the only best way to modify kernel code and TCP stack ?

And you know arch/x86/kvm/ktcp.c is simple like userspace socket program and eventually call functions of net/socket.c. What is your opinion how we modify ktcp.c and kernel code ?

I agree on using BPF is the simplest way to improve connection protocol, but I don't agree on it's the best way to improve performance.

For example, why can't ktcp.c be modified just to read/write some data on NIC driver directly? It's what we want to do in BPF, and we can achieve same goal to modifying ktcp.c as I think.

Userspace does that with BPF because they can't inject code to kernel directly, but we can do that, so BPF is not needed.

In my first comment, I introduce this article. In that article, they write this code as BPF:

__section("sk_msg")
int bpf_tcpip_bypass(struct sk_msg_md *msg)
{
 struct sock_key key = {};
 sk_msg_extract4_key(msg, &key);
 msg_redirect_hash(msg, &sock_ops_map, &key, BPF_F_INGRESS);
 return SK_PASS;
}
char ____license[] __section("license") = "GPL";

So this program's purpose is to write raw data in userspace to driver's tx queue directly. But we can write kernel-level code, so we do not need that complex way, we can just get tx queue pointer and write on that!

To summarize my opinion briefly, yes, we can use BPF. but we can do the exact same thing with native code, why should we choose the BPF? Even the JIT cost is near zero, it's not zero. BPF can't make better performance than native code, so I have curious that the reason why should we use BPF.

Taeung commented 3 years ago

OK, How about sending PR on your opinion ? @ChoKyuWon

ChoKyuWon commented 3 years ago

@Taeung Okay. I'll write some example code and leave it here as a comment. If it looks good, we can put more effort into that.

However, I think improving TCP performance is not that important if GiantVM is running on RoCE. RoCE is a network protocol that runs over common ethernet NIC, not RDMA-aware NIC.

image

So I think if current GiantVM support RoCE, TCP support is not needed. How do you think about it?

Taeung commented 3 years ago

I'm not sure how it improve performance. So if sending the change about that, we can test performance comparison.

And If the change is only for our project, I'm not sure it is the best way becuase it is just temporary measure for GaintVM and our project would become increasingly isolated in specific kernel version (e.g. v4.18.20).

IMHO it is better to flexibly change kernel code if being convenient for rebasing of more recent kernel version.

solemnify commented 3 years ago

Thanks for your hot discussion so far. For now, it is not easy to conclude which one is better either because we don't have much experience on the BPF/XDP and the RoCE. Even two approach may have different pros and cons. Careful review will be needed. Please give us some time. Anyway, we are very impressed with you all. Thank you again!!!

lsahn-gh commented 3 years ago

@Taeung

I also think your opinion isn't totally wrong way. Using BPF or modifiying kernel code has it's advantages and disadvantages. But I think if we can simply improve TCP performance using BPF/XDP, we can try to do that. Why shouldn't we try to use BPF/XDP ? Is it the only best way to modify kernel code and TCP stack ?

And you know arch/x86/kvm/ktcp.c is simple like userspace socket program and eventually call functions of net/socket.c. What is your opinion how we modify ktcp.c and kernel code ?

It sounds good to me or we can use raw sockets, netfilter hooks at ingress/egress points and so on.

One more thing for the main article that the benefit of TCP is connection-oriented communication and therefore we might less consider how to deal with some faults on request/response over network. In the case, faults mean something like packet-loss.

I don't think bypass the TCP stack is a good idea... or we need consideration on it. We can ask ourselves that it's really good choice not to retransmit the faulted packet because of the performance and let GiantVM wait for its timeout/take the responsibility.