eycorsican / go-tun2socks

A tun2socks implementation written in Go.
MIT License
1.29k stars 431 forks source link

Wasted layer of buffering and inneficciency #149

Closed fortuna closed 3 years ago

fortuna commented 3 years ago

tcpRecvFn only reads one packet at a time from the pbuf (p.tot_len), passing it to Receive, which blocks on the application call to conn.Read. It gets a buffer from the pool if the packet spans multiple pbuf entries, allocating a new buffer if the packet is larger than 2KiB.

Problem 1) The buffer is held for the duration of the Read, which in practice means all the time. That buffering layer is actually unnecessary.

Problem 2) Only one packet is read, even though there may be a lot more data available. That's a missed opportunity for higher efficiency

You can fix the problem by switching the data flow from push to pull and writing directly to the application-provided buffer:

There are possibly some concurrency nuances to handle closing and errors. The implementation of io.Pipe has some insights.

/cc: @bemasc @alalamav

fortuna commented 3 years ago

If you don't want tcpRecvFn to access private members of TCPConn, then you could implement a local Reader with the logic above, and pass that Reader to the newTCPConn constructor, to be used in place of tcpConn.sndPipeReader. You could then get rid of tcpConn.Receive.

eycorsican commented 3 years ago

I think it's possible to avoid this buffer allocation and this copy by Receive pbufs in a loop one by one in this manner, dechaining a pbuf from a chain seems quite easy, there is pbuf_dechain function provided by lwip.

fortuna commented 3 years ago

That would help remove the extra 2KiB per connection. However, each Receive call still blocks on a Read call, so we would be limiting the Reads to one pbuf, when we could have the application read as much as it can in one go if we switch from push to pull.

One thing I don't understand is how the stream or packet is split into pbufs. Is there a max pbuf size? Is it a configuration option?

fortuna commented 3 years ago

To give you some more context about the problem we are having: the Outline client uses go-tun2socks to send traffic to a Shadowsocks server. Shadowsocks puts data in encrypted frames. We can accumulated as much as 16KiB in a frame, but because we don't know whether the next Read will block, we are forced to emit a frame for every Read.

Ideally, I'd give tun2socks my 16KiB buffer and it would fill it with as much available data as possible, possibly from multiple packets.

The Shadowsocks overhead is not a lot: 34 bytes per frame. But assuming a 1500 MTU - TCP/IP overhead = 1420 we are talking about sending 2.5% more bytes for every byte sent. Not bad, but that's the minimum overhead and it can be an issue if we have small pbufs. For example, that's 25% if we are reading 1.4KiB pbufs.

eycorsican commented 3 years ago

This PR would solve the blocking issue, and if tcpRecvFn is called fast enough (cgo is slow), the application would get as much data as possible. But a buffer is introduced. It's not clear to me how we could solve both problems at the same time.

One thing I don't understand is how the stream or packet is split into pbufs. Is there a max pbuf size? Is it a configuration option?

FWIW, it depends on how the pbuf is allocated and the input IP packet size, the tcpRecvFn callback function is mainly driven by packet input.

eycorsican commented 3 years ago

I see three issues here:

  1. Extra intermediate buffering
  2. Application Read blocks lwIP
  3. Application Reads almost always return small data chunks

I implemented your idea above (If I'm understanding it right) https://github.com/eycorsican/go-tun2socks/commit/425fb2eb0eb166050d948408ecd498382bc5da82 and it should solve 1 and 2, without intermediate buffering, I don't see any chance to solve 3.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days