ktcp send is too slow - Githubissues

Mellvik / TLVC

Tiny Linux for Vintage Computers

Other

9 stars 0 forks source link

ktcp send is too slow #67

Closed Mellvik closed 2 months ago

Mellvik commented 3 months ago

This issue is described in detail in #64 under the heading TCP speed kludge, which essentially observes that introducing a delay in the ee16 driver's write select function will drastically increase send-speed.

What happens is that the delay prevents an extra round through select_wait/wakeup. A small delay causing a big speedup by 'disturbing' the transactional rhythm of the send process between ktcp and the driver. A complicated issue indeed because of the design of the system, but still one of great interest because of the improvement potential.

The 3C509 driver was just updated (by accident directly to master) with a delay similar to the ee16 driver and the ftp-send speed improvement was a factor of 4, sometimes beating 100kbps. This is remarkable - not the least because the max outgoing tcp packet size is 512, meaning there are a lot of packets processed.

ghaerr commented 3 months ago

Interesting. I'm not looking at the source, but it seems what you're saying is that the NIC driver, after finishing the transmit of a packet, does not have access to another queued packet, so it stops sending. Meanwhile, ktcp is waiting in select for either write OK from the driver (or a read OK indicating another packet received, which is not this case, or is it?). Ktcp then returns from select and writes another packet. How exactly is the write driver delay (before or after sending a packet?) allowing this to speed up?

If ktcp is actually waiting for an ACK from the driver before continuing to provide the next packet to write, then I suppose this is more complicated in that it opens an additional possibility that ktcp is running processing the received ACK when an additional NIC write complete interrupt occurs. In this case perhaps the select wait code isn't correct and the write-available status isn't set properly, causing ktcp's next call to select to wait for yet another interrupt (from something?) in order to be allowed to proceed. In this case it is extremely important that the driver mutex read/write rules are followed to prevent an inadvertent race condition on accessing the write-OK variable. I believe it is correct but its been a while since looking at it. Given that another NIC driver is showing the same symptoms, this likely isn't the cause, but I think all the drivers share much the same mutex access code.

With regards to select wait/wakeup, other than going into the kernel and back which isn't exactly super fast but not really slow, there could be the following reasons a select wait/wakeup cycle taking lots of time:

wakeup is a bit slow, it for-loops through all 16 tasks every time its called, which previous kernel inspection has shown may be lots more than one thinks.
select itself is somewhat slow, due to its nature as well as the tricky method used to add wait addresses to the poll queue.
I will have to check for sure, but the kernel doesn't multi-task a process on return from a system call, unless blocked. In these cases, the hardware timer must expire for another task to be rescheduled. This also may happen in reverse, where the idle task may be running and the timer interrupt (rather than say, the NIC interrupt) may be what ends up causing a reschedule back to ktcp. Its done this way since kernel reentrancy is protected by not task switching when an interrupt occurs when already in kernel code. Bottom line is that if ktcp is busy looping for some reason, it could be the timer interrupt that causes a task switch between ktcp and the consuming/writing application, either of which could be taking up lots of time per time slice.

In general, I don't quite yet understand exactly what you've found that is allowing a NIC delay to speed up overall throughput, but nice finding. I could try to help debug but the NIC drivers aren't (yet?) compatible between TLVC and ELKS so I can't easily setup a test rig to duplicate the result.

Mellvik commented 3 months ago

Thanks @ghaerr - I was hoping this issue would catch your interest. As is often the case, just asking the questions differently cause entirely different chains of thought. And I suspect this may be complicated.

IMPORTANT (and missing) BACKGROUND: First let me add an important piece of information that I completely forgot in the PR notes (and the issue-report): This situation does not occur if both nodes are on the same wire/segment (the simple case). If on the same wire – and given the fact that the peer system, whether a Mac, Linux or Windows system, is an order of magnitude or two faster than the TLVC system – any outgoing transfer from TLVC is literally packet-synchronous: send packet, get ACK, send packet, get ACK. Indeed, the acks pops up almost before the send packet leaves the NIC. There are rarely if ever any read-select-waits at all during the transfer and the speed is really good.

The problem (or situation if you like) as described, occurs only when there is a router between the TLVC system and the peer, usually my Mac. The router, while being a fast Unify EdgeRouter, adds sufficient dynamism to the picture to break the synchronous pattern of the wire-local setting. Different timings and packets no longer strictly in send-ack-send-ack order. The TCP windowing gets to work, and that's seemingly where the problem arises.

From the PR notes, remember that while the delays that ultimately revealed this speedup trick were in the xmit interrupt handler, xmit has nothing to do with it. The delay was moved around until I found the 'optimal' placement - in read_select: If (nothing to be read) then add delay and check again. If we're in a send flow and the optimal delay value has been found, we catch the next ack as it arrives and avoid a call to select_wait.

A side note: Keep in mind that neither of the drivers with which this trick has been tested is using driver level buffering. The dataflow is still userspace to NIC and back. With buffers, the picture would presumably look quite different.

I put some tracers back into the code this morning and ran some more tests which I have yet to study. A cursory look points to ktcp, like before. Without looking at the code, it seems a read_wakeup will trigger repeated calls to select_read before even attempting the next write. I'll have more on that shortly.

Mellvik commented 2 months ago

@ghaerr, I finally got some screen time to look closer at this and ended up with a verdict so to speak. (Among other things I found that your comment a while back that the road from user process to NIC driver and back is really messy was spot on.)

Here's the short version: When send gets too far ahead (in practice 4 packets) of the received acks, it throttles by returning -ERESTARTSYS to the kernel/writing process. Which it should do, but where is the balance point, how far ahead does it make sense to let send get? What are the actual tradeoffs and which knobs are available for tuning?

The experience that led to this 'hunt' speaks for itself: A minor delay at the right point changes the send speed by a factor of 3 to 4. How? By ensuring that an ACK gets received and processed just in time to avoid TCP_SEND_WINDOW_MAX to be exceeded (in tcpdev_write).

If this limit IS exceeded, which is the normal (and slow) case, the sending process will be delayed by 0.1 seconds before trying again. Which sounds minimal, but in a file transfer flow, it's obviously big (and always ends up being more than .1 seconds, how much more depends on the circumstances).

The quick fix is to increase TCP_SEND_WINDOW_MAX, which works at the cost of reducing the # of concurrent connections that can work efficiently. Empirical data are forthcoming.

A different fix which I haven't looked into yet, is the 0.1 second delay when the WINDOW_MAX value is exceeded. This is set in the af_inet kernel code (inet_write), and may be worthwhile experimenting with. Since there is always a schedule() involved, I'm unsure how much room there really is for adjustment.

Part of the challenge is that the effect and consequences of the changes made will be highly dependent on the speed of the NIC, driver and (in particular) the machine. It may actually de useful to introduce a speed metric a la BOGOMIPS in order to cover the span from 4.77MHz XTs to 40+MHz 386 machines.

To be continued.

ghaerr commented 2 months ago

I think I'm starting to understand what you are explaining, but its been a while since I jumped in to this section of code. At the upper level though it sounds like ultimately this major speed issue is the result of complications of the way ELKS handles send window flow control... ?

Looking at ee16.c, yes I see now the kluge. Definitely not something that should stay in, but serves as a marker post for at least allowing playing with the inter-system tuning. udelay isn't using a real time for a delay, so I don't really have any idea how much delay we're actually talking about here. Perhaps looking at the ASM source for udelay and the CPU and I/O bus speed would allow you to calculate how many microsecond delay is actually required. (Such a measurement capability would also be useful for seeing how long a reschedule takes and/or how long it takes to process a packet remotely, see below).

It may actually de useful to introduce a speed metric a la BOGOMIPS

BOGOMIPS is very outdated and CPU dependent, it would be better to calculate and use real time somehow to get a better long-term handle on what's happening with packet transmission and the problematic window management code. If the system is 386, the CPU supports the RTDSC opcode which will return real time info (I have code, more details on request).

The quick fix is to increase TCP_SEND_WINDOW_MAX, which works at the cost of reducing the # of concurrent connections that can work efficiently.

This sounds like a potentially good idea. How are you thinking it reduces the # of concurrent connections, is this because of a fixed kernel buffer size or instead the ktcp buffer size per connection? I am also wondering whether reducing the window size by one packet might also cause better results... (Only because I think the bigger issue is a system-to-system-dependent throughput as you mentioned but also related to crude window management that becomes very inefficient only around the edge case where a single final packet is held back due to the total system throughput). (On that note, perhaps send window size could be calculated from actual packet trip time measured accurately, and just adjust window max accordingly. There is some RTT code already in ktcp, but IIRC its only used for retransmit backoffs?).

A different fix which I haven't looked into yet, is the 0.1 second delay when the WINDOW_MAX value is exceeded.

Since there is always a schedule() involved, I'm unsure how much room there really is for adjustment.

This could work with probably a lot smaller delay, as the 0.1 second delay was completely arbitrary. Some delay before schedule is required, as otherwise the system would likely busy-loop and the idea was to let something else run. However, there's a huge difference between a schedule-based context switch and 100ms for sure (as the system handles a process time slice every clock tick which is every 100ms!) Of course, one must also consider a large delay allowing for the remote system to actually process the outstanding window before time-slicing back into ktcp for no reason...

Complicated, for sure. I'm thinking off the top of my head that removing all kluge delays and coming up with a solution that adjusts the outstanding window size based on multi-packet throughput could get close to max efficient, and possibly a variable delay for the 0.1 timeout based on CPU speed that allows for outstanding packets to be processed, although this is likely far less important once the send window has forced transmission to stop.

Mellvik commented 2 months ago

Thank you @ghaerr, this is becoming really useful on a path that may deliver something scaleable - and efficient!

I think I'm starting to understand what you are explaining, but its been a while since I jumped in to this section of code. At the upper level though it sounds like ultimately this major speed issue is the result of complications of the way ELKS handles send window flow control... ?

That's right. The simplicity of the tcp implementation has side effects that pop up in certain circumstances. Like this one which I probably hadn't discovered if my setup was simpler. An interesting challenge - find a fix without adding complexity.

Looking at ee16.c, yes I see now the kluge. Definitely not something that should stay in, but serves as a marker post for at least allowing playing with the inter-system tuning. udelay isn't using a real time for a delay, so I don't really have any idea how much delay we're actually talking about here. Perhaps looking at the ASM source for udelay and the CPU and I/O bus speed would allow you to calculate how many microsecond delay is actually required. (Such a measurement capability would also be useful for seeing how long a reschedule takes and/or how long it takes to process a packet remotely, see below).

udelay is actually a loop of outb instructions so it's possible (but not necessarily useful in the bigger picture) to calculate the delay from the processor type and clock frequency.

It may actually de useful to introduce a speed metric a la BOGOMIPS

BOGOMIPS is very outdated and CPU dependent, it would be better to calculate and use real time somehow to get a better long-term handle on what's happening with packet transmission and the problematic window management code. If the system is 386, the CPU supports the RTDSC opcode which will return real time info (I have code, more details on request).

BOGOMIPS may be outdated, but so are the processors we're dealing with, including the 386 and later in real mode. The usefulness of a BOGOMIPS value would be the ability to scale delays like udelay() to the actual processor. There are plenty of such delays in drivers around TLVC and ELKS, not the least in the floppy driver that would benefit from better scaling. That said, having real time would of course be better.

The quick fix is to increase TCP_SEND_WINDOW_MAX, which works at the cost of reducing the # of concurrent connections that can work efficiently.

This sounds like a potentially good idea. How are you thinking it reduces the # of concurrent connections, is this because of a fixed kernel buffer size or instead the ktcp buffer size per connection?

Not intending to be pedantic, but I didn't say 'reduce the # of connections' but 'the # of concurrent connections that can work effectively '. Meaning that increasing the # of outstanding packets per connection would eat more buffer space per connection, leaving less for new connections. OTOH this may be somewhat esoteric given the environment: 1) how many concurrent high traffic connections are we likely to need/have, and 2) this is dynamic - it's not like the max outstanding is a reservation, it's limitation. So it may not matter much. (To be tested).

I am also wondering whether reducing the window size by one packet might also cause better results... (Only because I think the bigger issue is a system-to-system-dependent throughput as you mentioned but also related to crude window management that becomes very inefficient only around the edge case where a single final packet is held back due to the total system throughput). (On that note, perhaps send window size could be calculated from actual packet trip time measured accurately, and just adjust window max accordingly. There is some RTT code already in ktcp, but IIRC its only used for retransmit backoffs?).

This is interesting - including reducing the window size, I'll test that. A reminder that this is not entirely deterministic. Running exactly the same code, transferring exactly the same file to exactly the same peer from a different TLVC/ELKS machine would likely deliver very different results. In the same way that running a ping alingside a file transfer may speed up the transfer because it (the ping) changes the rhythm of the packet flow. This points toward getting a handle on actual real time for some automatic tuning, as you refer to several times.

A different fix which I haven't looked into yet, is the 0.1 second delay when the WINDOW_MAX value is exceeded.

Since there is always a schedule() involved, I'm unsure how much room there really is for adjustment.

This could work with probably a lot smaller delay, as the 0.1 second delay was completely arbitrary. Some delay before schedule is required, as otherwise the system would likely busy-loop and the idea was to let something else run. However, there's a huge difference between a schedule-based context switch and 100ms for sure (as the system handles a process time slice every clock tick which is every 100ms!) Of course, one must also consider a large delay allowing for the remote system to actually process the outstanding window before time-slicing back into ktcp for no reason...

Again, interesting. And this one is easy to test. I pointed out before the the peer system most likely is one or several orders of magnitude faster that the TLVC machine, but there is always the case to consider that the other system may even be slower (like another TLVC or ELKS system). Also there is the case at hand where the router is talking Fast Ethernet, possibly GE, to the peer, old fashioned (10M) Ethernet to our host, forcing it to do buffering which will affect delays.

Complicated, for sure. I'm thinking off the top of my head that removing all kluge delays and coming up with a solution that adjusts the outstanding window size based on multi-packet throughput could get close to max efficient, and possibly a variable delay for the 0.1 timeout based on CPU speed that allows for outstanding packets to be processed, although this is likely far less important once the send window has forced transmission to stop.

Agreed. For now - while experimenting - the kludge remains in place as a metric, with the ability to be turned on/off via bootopts. Some rough numbers as to the effect of increasing/decreasing the max windows are next, and similarly for the timeout: how much of a difference does 50ms vs 100ms make? And finally, having 4 different hw platforms available there is ample opportunity for algorithm testing/verification.

ghaerr commented 2 months ago

How are you thinking it reduces the # of concurrent connections, is this because of a fixed kernel buffer size or instead the ktcp buffer size per connection?

Not intending to be pedantic, but I didn't say 'reduce the # of connections' but 'the # of concurrent connections that can work effectively '. Meaning that increasing the # of outstanding packets per connection would eat more buffer space per connection, leaving less for new connections.

What I was trying to ask was how you think the buffer size is reduced, thus becoming less efficient? If we're talking kernel buffers, there are a few fixed buffers that aren't easily extended. If we're talking about the ktcp buffer per connection, then a small increase of buffer size, or possibly a realloc of the existing connection buffer (after the initial malloc) might solve this issue. I know that ktcp of course can run out of memory, but don't have info on how often this happens. I also seem to remember some thought about reducing the memory buffer size for listening connections, I can't remember if that got implemented or not. Ktcp has a lot of control over buffer sizes if that's where you see a potential speed/efficiency limiting factor.

Mellvik commented 2 months ago

I guess we're missing each other's point here, @ghaerr.

The TCP_SEND_WINDOW_MAX sets the max number of outstanding (non-acked) packets (actually bytes, not packets) per connection and is allocated from a buffer pool shared by all connections, currently 4k (in TCP_RETRANS_MAXMEM). What I was saying (actually speculating) was that allowing one connection to take a larger part of that buffer space may - theoretically - affect other connections competing for the same resource. Then I corrected myself admitting that that may be a very theoretical scenario indeed because of the environment. How many such concurrent connections are we likely to have?

I think we're fine with the current values and upping the WINDOW to 1536, which looks very promising this far.

ghaerr commented 2 months ago

The TCP_SEND_WINDOW_MAX sets the max number of outstanding (non-acked) packets (actually bytes, not packets) per connection and is allocated from a buffer pool shared by all connections, currently 4k

I guess we're missing each other's point here

Ah, I was missing your point by trying to think about all this without looking at the code. Thanks for the reminder explanation. So the potential issue and fix may be limited to the retransmit code and buffers only, rather than the general workings of the TCP stack prior to having transmitted the packets in question. I was thinking about the upper level TCP operation, not the specific retransmit dynamics. I remember that the lower level retransmit code has never been great at resyncying in certain edge cases, and that also may need to be added to the list of things to consider in a better solution. [EDIT: After reading some code, I understand better what you're saying - TCP_SEND_WINDOW_MAX is related to the retransmit buffer size, in that any sent packet uses TCP_RETRANSMIT_MAXMEM, but we're not talking about retransmit issues here at all - only that it might be better to increase TCP_SEND_WINDOW_MAX, which you stated initially as an easy solution.]

As I write this, being involved in another project using lots of memory, it occurs to me ktcp could possibly be fairly easily enhanced to use far memory for a lot larger retransmit buffer outside the ktcp process space if that kind of thing would help, given the current limitations of the retransmit design. This could also be done for the per-connection buffers as well. (I have just finished a major upgrade to ELKS that allows for multi-segment, large/medium/compact model binaries of any size to be loaded and executed, so have a bit changed perspective on real mode memory usage).

Mellvik commented 2 months ago

As I write this, being involved in another project using lots of memory, it occurs to me ktcp could possibly be fairly easily enhanced to use far memory for a lot larger retransmit buffer outside the ktcp process space if that kind of thing would help, given the current limitations of the retransmit design. This could also be done for the per-connection buffers as well. (I have just finished a major upgrade to ELKS that allows for multi-segment, large/medium/compact model binaries of any size to be loaded and executed, so have a bit changed perspective on real mode memory usage).

Very interesting indeed, @ghaerr - and incidentally something (far buffers, that is) I've been contemplating recently, although coming from a very different angle and not immediately related to retransmits.

Background: I started working on an AMD Lance/79C760 driver a few weeks back. This is a very different animal in that it does not have any NIC buffers whatsoever, but uses host memory and DMA directly. A complication in many ways - including the real time requirements (there is a 16 or 32 byte FIFO, that's it) - similar to direct floppy but at much higher data rates. Then there are the 64k physical boundary restrictions plus alignment requirements.

Heading over to an early Linux Lance driver, I discovered 'socket buffers', 'skbufs', which - as it turns out, and you may be familiar with this already - are to sockets what (I/O)buffers are to mass storage. Realtively easy to implement in a 32bit/virtual setting, less so in our environment. I decided to go with heap allocation instead, thinking it would be easier - and it probably would, but the skbufs keep coming back. Like while investigating the issue in this thread, (re)discovering the number of buffer copies happening on the way from the application to the NIC and back. With (external) skbufs, the data could be passed along as pointers instead, just like in the buffer system, freeing up both application and kernel space and speeding things up. Possibly (I haven't thought this one completely through yet) even eliminating the retransmit buffer completely, keeping 'dirty' skbufs around until ack'ed out of the way. The implementation could also take care of special requirements like physical address boundaries and alignment. And finally, NICS would read/write data from/to skbufs which would replace the buffer layer I've been experimenting with in the ne2k driver using heap buffers. I'd be interested in your thoughts on this.

Back to the case at hand - and for now - I think increasing the retransmit buffer space somewhat, allowing more data outstanding per connection takes care of the immediate requirements - I'm collecting more empirical data on that as we speak. As reported before, increasing the MAX_WINDOW to 1536 from 1024 fixed the problem in this particular setting, but I'm interested in whether reducing the wait timeout to say 50ms is useful too, and whether that usefulness prevails on very slow machines.

ghaerr commented 2 months ago

I discovered 'socket buffers', 'skbufs' With (external) skbufs, the data could be passed along as pointers instead, just like in the buffer system, freeing up both application and kernel space and speeding things up.

Yes, skbufs are pretty neat, mostly for their ability within a layered (network) system to remove copies associated with adding/removing TCP/IP/ethernet headers within the TCP/IP implementation, as well as integrating nicely within a (character) I/O subsystem built around readv/writev, where the network utilities themselves pass lists of data back and forth. A potential drawback is the need for almost everything to be converted to lists of data, rather than an array.

While it might seem like a big improvement to rewrite everything for efficiency, the data still has to go somewhere and then we have the issue of whether skbufs would be automatically allocated anywheres in memory, which could result in memory checkerboarding, unless the skbufs are are allocated at once, in which case may have some of the fixed-size buffer problems we're seeing now, lessening the advantage of skbufs.

And finally, NICS would read/write data from/to skbufs which would replace the buffer layer I've been experimenting with in the ne2k driver using heap buffers.

My take is we're probably better off not rewriting everything to use skbufs, but instead use a modified idea of the heap allocation you're already using - use far rather than near heap allocation. Both the NIC kernel buffers and the ktcp retransmit buffer could likely be fairly easily converted to using main memory and far pointers. This would not increase any copying time between NIC buffer and user programs, as we're already using fmemcpy for that. If these buffers were all allocated at NIC open and ktcp startup, then they'd likely be packed together in one area of main memory and not introduce any checkerboarding. IMO this would go a long ways towards allowing large NIC buffering and ktcp retransmit buffering, all dynamically allocated by reading /bootopts or /etc/net.cfg. (ELKS added dynamic task array some time back, so the task array itself can now be dynamically adjusted to system requirements).

for now - I think increasing the retransmit buffer space somewhat, allowing more data outstanding per connection takes care of the immediate requirements

That sounds like a good fix to me, with the idea of then adding the ability to dynamically allocate both NIC and retransmit buffers first from the local heap (at open time) and later from far main memory.

Mellvik commented 2 months ago

Thank you @ghaerr, this is good.

My take is we're probably better off not rewriting everything to use skbufs, but instead use a modified idea of the heap allocation you're already using - use far rather than near heap allocation. Both the NIC kernel buffers and the ktcp retransmit buffer could likely be fairly easily converted to using main memory and far pointers. This would not increase any copying time between NIC buffer and user programs, as we're already using fmemcpy for that. If these buffers were all allocated at NIC open and ktcp startup, then they'd likely be packed together in one area of main memory and not introduce any checkerboarding. IMO this would go a long ways towards allowing large NIC buffering and ktcp retransmit buffering, all dynamically allocated by reading /bootopts or /etc/net.cfg. (ELKS added dynamic task array some time back, so the task array itself can now be dynamically adjusted to system requirements).

I completely agree. You might even say this is a roadmap towards skbufs light. And yes, converting the 'participants' to use and external heap implementation should be no big deal. What I suggest is that if you implement 'external heap' in ELKS, I'll port it over to TLVC and adapt ktcp and the network drivers to take advantage of it. The difference between net drivers in ELKS and TLVC should be minimal - primarily handling of /bootopts parameters plus the buffer implementation, currently limited to the ne2k driver. [Meaning that getting the ee0 driver up on ELKS should be easy.]

Thanks for the heads up on the dynamic task array in ELKS, I'll take a look at that right away.

for now - I think increasing the retransmit buffer space somewhat, allowing more data outstanding per connection takes care of the immediate requirements

That sounds like a good fix to me, with the idea of then adding the ability to dynamically allocate both NIC and retransmit buffers first from the local heap (at open time) and later from far main memory.

Right. The current TLVC setup with NIC buffers specified in /bootopts fits well into this scenario.

I've done a lot of testing on this and among other things found that changing (shortening) the sleep timeout will sometimes cause a deadlock - on fast machines and 2 concurrent outgoing transfers, possibly other scenarios as well. Slower systems, XT and 286AT, don't have this problem at all - the network being faster than the machine and never any waits.

ghaerr commented 2 months ago

What I suggest is that if you implement 'external heap' in ELKS, I'll port it over to TLVC and adapt ktcp and the network drivers to take advantage of it.

Allocating from main memory rather than the kernel local heap is already done, although a "SEG_FLAG_DRVR" flag should probably be implemented so that meminfo shows the buffers as DRVR buffers. Basically, instead of doing this in the (direct floppy) driver:

    char *floppy_buffer = heap_alloc(bytes, HEAP_TAG_DRVR);
    if (!floppy_buffer)
        return -ENOMEM;
    ...
    heap_free(floppy_buffer);

you could use something like this (in the open and close routines):

    segext_t paras = ((bytes + 15) & ~15) >> 4;
    segment *mseg = seg_alloc(paras, SEG_FLAG_DRVR); // new flag name
    if (!mseg) return -ENOMEM;
    char __far *floppy_buffer = _MK_FP(mseg->base, 0);
    ...
    seg_put(mseg); // free buffer

You'll find that the char __far * can get messy in a hurry unless you think clearly about exactly how/which functions will access the buffer, so that only a few far pointers are required elsewhere.

The difference between net drivers in ELKS and TLVC should be minimal [Meaning that getting the ee0 driver up on ELKS should be easy.]

It would be great if you could help with creating a TLVC & ELKS compatible NIC driver standard (either through an initial document of sorts or just providing an initial .c file for one NIC, so that other .c files might come over unmodified. I can do the ELKS kernel "upstairs" work if needed and there's likely some issues with /bootopts variables, but I don't have the capability of testing any NIC source outside of QEMU, so few NIC enhancements have been done since your TLVC fork regarding networking. Your NIC coding, testing and knowledge are missed over at ELKS! Driver compatibility between ELKS and TLVC will also help users move to TLVC or vice-versa if desired. The biggest issue is likely containing the "global" variable access and external variables required by a driver - ideally this should be very small and come through well-defined interfaces (yes, we need a /bootopts interface/struct, really).

found that changing (shortening) the sleep timeout will sometimes cause a deadlock

That is strange, I don't yet have any idea why that may be. A shortened timeout would seem to just repeat the kernel <-> ktcp transfer request over and over, it would seem, rather than hanging.

Mellvik commented 2 months ago

Thanks @ghaerr - very interesting!

What I suggest is that if you implement 'external heap' in ELKS, I'll port it over to TLVC and adapt ktcp and the network drivers to take advantage of it.

Allocating from main memory rather than the kernel local heap is already done, although a "SEG_FLAG_DRVR" flag should probably be implemented so that meminfo shows the buffers as DRVR buffers.

I didn't know that! Great - how do you suggest we access this from ktcp? Also, when moving some buffers from kernel heap to 'general memory', would it make sense to shrink the heap in order to balance the total memory availability?

The difference between net drivers in ELKS and TLVC should be minimal [Meaning that getting the ee0 driver up on ELKS should be easy.]

It would be great if you could help with creating a TLVC & ELKS compatible NIC driver standard (either through an initial document of sorts or just providing an initial .c file for one NIC, so that other .c files might come over unmodified. I can do the ELKS kernel "upstairs" work if needed and there's likely some issues with /bootopts variables, but I don't have the capability of testing any NIC source outside of QEMU, so few NIC enhancements have been done since your TLVC fork regarding networking. Your NIC coding, testing and knowledge are missed over at ELKS! Driver compatibility between ELKS and TLVC will also help users move to TLVC or vice-versa if desired. The biggest issue is likely containing the "global" variable access and external variables required by a driver - ideally this should be very small and come through well-defined interfaces (yes, we need a /bootopts interface/struct, really).

I agree on all points. The driver API is obviously the same since it is equal across all device drivers, but the globals is a different story. Actually, given that we're now dealing with 5 NIC types, it is time to restructure the globals to avoid the static and increasingly RAM consuming and rigid structure we have today. Maybe even part with the fixed device number per NIC type regime. Of course that ends up being a bigger project but the point here being that we share it so they become compatible. BTW - you may want to check out 86box for access to more NICs. One reason I started one the LANCE NIC is that it is supported by VirtuaBox. It would be useful to have TLVC (and ELKS) on Virtualbox with networking.

Anyway, I'll start off by creating a wiki-entry on the globals and the boot options - and possibly the current simple setup for NIC buffering used by the ne2k driver. As new drivers are added, new requirements to boot parameters pop up, like DMA chan for the Lance driver. This falls under your statement about the need for a bootopts 'interface'.

BTW - side issue: The EtherExpress driver implements both share memory access and PIO. Testing shows no significant difference in performance between the two, regardless of system speed, and I'm tempted to ditch (#ifdef) the shmem code entirely in order to reduce the size of the driver. Any opinions?

found that changing (shortening) the sleep timeout will sometimes cause a deadlock

That is strange, I don't yet have any idea why that may be. A shortened timeout would seem to just repeat the kernel <-> ktcp transfer request over and over, it would seem, rather than hanging.

I haven't spent any time on this yet, and I'm tempted to just let it god for now, with a comment in the af_inet.c file. Remember though that this only happens when two or more transfers are running in parallel. What happens in both end up in the same (say) 50ms wait, at almost the same time? Different queues, but how would this interact with the regular scheduling, and why would 50ms be different than 100ms? Lots of questions - for another day.

ghaerr commented 2 months ago

how do you suggest we access this from ktcp?

That example code is only for the kernel or NIC drivers allocating main memory for its own use. For ktcp or other application programs, fmemalloc can be used to allocate from main memory:

    segext_t paras = ((bytes + 15) & ~15) >> 4;
    seg_t seg;
    if (fmemalloc(paras, &seg))
        return errno;
    char __far *p = _MK_FP(seg, 0);
    ...
    (currently no way to free far application memory, freed by exit())

This would be the mechanism suggested to rewrite ktcp's retransmit buffer to use far memory (if desired), and isn't used for directly sharing data with the kernel. The existing /dev/tcpdev method of moving data between applications and the kernel will remain the same, the only differences will be kernel or application buffers being allocated far to conserve the 64k max data segment size existing in both app and kernel.

Since the network drivers are opened and ktcp started sequentially usually during system startup, I would think that most of these buffers would be allocated from a non-checkerboarded main memory and remain unchecker-boarded. This could change slightly depending on what exactly is run during /etc/rc.sys, with details checked using meminfo after startup.

would it make sense to shrink the heap in order to balance the total memory availability?

No - in general we'll want max heap in both ktcp and the kernel to be usable for other things (like the dynamic task structure and ktcp connection buffers, etc). 64K itself isn't that much total system memory, its the 8086 segment offset limit that's the primary restriction.

ghaerr commented 2 months ago

the globals is a different story. Actually, given that we're now dealing with 5 NIC types, it is time to restructure the globals to avoid the static

An easy way to accomplish this is by realizing we only need source, not binary, compatibility. This means that different previous global variables, all possibly dependent on system configuration, can be stored inside a single structure, with the driver or bootopts code then referencing structure elements. ELKS and TLVC can have different configurations or options but keep source compatibility. Here's an possible example to highlight the idea:

options.h:

#include <netstat.h>
struct bootopts {
   int root_mountflags;  // various variables settable by /bootopts
   int running_qemu;
   struct netif_parms[MAX_ETHS]; // per-NIC setup
#ifdef TLVC
   int some_variable; // TLVC-specific variable
#endif
#ifdef ELKS
   int someother_var; // an ELKS variable
#endif
   ...
};
extern struct bootopts bootopts;

elks/init/main.c:

    struct bootopts bootopts;
    ...
    bootopts.running_qemu = ...
    bootopts.netif_parms[ETH_NE2K].irq = ...
   ...

NE2K NIC driver:

#include <options.h>
...
struct netif_parms *netparm = &bootopts.netif_parms[ETH_NE2K];
...

    ... (later inside any function)
   if (netparm->irq == 3) ...
   ...
   netparm->flags = ...
   ...
   if (bootopts.running.qemu) ...

What happens is that the size and actual layout of struct bootopts (as well as struct netif_parms) doesn't actually matter, as the structure is compiled differently on different TLVC or ELKS systems. The drivers themselves only access the variables through the global structure, and the structure is generally opaque except for those drivers that actually use the member variables (as well as of course init/main.c which sets them).

A driver could remain compatible if it only includes options.h and accesses globals through struct bootopts, except that any new driver could easily add more NIC-specific configuration data by just adding lines to the struct definition. Ultimately, even the init/main.c code which reads /bootopts could call out to a function in elks/arch/i86/drivers/net/init.c which would parse the data. This then would keep all the NIC code in a single directory (almost) and allow for "porting" a driver by copying a driver file and init.c from one system to another, and possibly adding some lines into include/linuxmt/options.h.

That's the general idea, I'll think a bit more about it after I read your wiki thoughts.

Anyway, I'll start off by creating a wiki-entry on the globals and the boot options - and possibly the current simple setup for NIC buffering used by the ne2k driver. As new drivers are added, new requirements to boot parameters pop up, like DMA chan for the Lance driver.

Great, this should work well!

ghaerr commented 2 months ago

I'm tempted to ditch (#ifdef) the shmem code entirely in order to reduce the size of the driver. Any opinions?

I'll leave it up to you to decide whether to ifdef the driver or just ditch the code. I continue to like simplicity and understandability, sometimes less ifdefs and less code is better. If the previous shmem code is checked in and marked, it would be easy to bring it back should the need arise.

I do have increasing opinions on ASM source being used in the kernel and drivers: with my recent work looking at the fantastic OpenWatcom C compiler (and of course its incompatible ASM format with GNU), I believe at this point its FAR better to have most code written in C, and then only after testing for speed, adding ASM code, and then using ASM library routines rather than special .S source files. Important instructions like I/O can be placed in kernel header files (e.g. io.h for outb etc) and then routines that just must be speedy, (e.g. fast programmed I/O to/from a port) could be installed as a kernel library routine (just like fmemcpy etc) for all to use.

Getting away from ASM also allows for moving to netparm->irq etc quickly (described above) especially when those variables are accessed within some non-standard ASM routine. Its the old story, everything's great until the C compiler is changed, then all the old porting no-nos come back to bite. While I don't think the kernel will likely be changed out to OWC soon, there are some huge benefits to its very complete implementation of all the original 8086 memory models and its ability to produce programs much larger than 128k, especially now that the new ELKS loader can load native 16-bit OS/2 large binaries produced without modification by OWC. (ktcp could be compiled in large model for instance which would much more easily eliminate many of the problems related to tight memory we're seeing now. And I removed all ASM from ktcp some time ago :)

ghaerr commented 2 months ago

@Mellvik, I made an error with the example using fmemalloc above, confusing the system call with the C library wrapper. The above is correct for _fmemalloc which is the actual system call. For using fmemalloc in application programs, the following simpler code is used, using the C lib wrapper:

unsigned long bytes = 1024; // 1K byte alloc
char __far *p = fmemalloc(bytes);
if (!p) return -ENOMEM;

Mellvik commented 2 months ago

Thanks @ghaerr, much appreciated- lots of food for thought and very helpful. We have a lot of common ground here, and your practical suggestions are very welcome.

I will push a pr with the fixes for the issue at hand and close this thread, then open a new issue/thread for the 'driver generalization/compatibiliy' issue.

I do agree with you as to the use of asm code while admitting that I have this old fashioned affection for asm code. Moving (and generalizing) nic asm code to a library has been on the list forever. Now is the time.

Now that I'm deep into ktcp I wanted to track down and fix the retransmit timeout calculation that kills tlvc-to-tlvc/elks-to-elks performance. Probably still an open issue on elks, it would be good to get it out of the way.

I've also been tracing a 20% performance hit when uploading files to TLVC via a router compared to same segment transfers (somewhat akin to what got this thread started in the first place except in the opposite direction). At this point it looks like the (good old) double ack may be part of the 'problem' - disturbing the 'rhythm of the flow'. Do you remember why we put in the double ack in the first place?

ghaerr commented 2 months ago

I will push a pr with the fixes for the issue at hand and close this thread, then open a new issue/thread for the 'driver generalization/compatibiliy' issue.

Sounds great! My plan would be to let you do all the work (lol :) since you're way further into this than I am. I'm happy to comment on your proposed design as the source is posted as a TLVC PR for all the NIC changes as well as anything required in init/main.c. I could then take the NIC source either by copying or directly by a PR from you and make the ELKS init/main.c changes so that ELKS networking runs on QEMU.

At that point, we would have two OSes running networking using the same NIC drivers, which I think could be advantageous for another data-point on system throughput, reliability, etc between operating systems on similar hardware. I believe that getting the NIC drivers to be compatible to be much more important than say character or block devices, which are more closely tied to the OS, whereas networking by its nature is about getting more systems communicating with each other. If the drivers come up and say TLVC, that's fine, hopefully more people may learn about TLVC that way.

Once we get to a basic level of driver source sharing, it will be easier for me to look at or suggest various far-buffer NIC or kernel solutions, etc by looking at actual code, instead of old code.

Now that I'm deep into ktcp I wanted to track down and fix the retransmit timeout calculation that kills tlvc-to-tlvc/elks-to-elks performance. Probably still an open issue on elks, it would be good to get it out of the way.

Sounds good. I'll go with the changes that you figure out.

Do you remember why we put in the double ack in the first place?

No, only vaguely. If you point out the previous code changed I can look further into it. I think it might have had to do with ACKs not being resent in circumstances but don't really remember.

Mellvik commented 2 months ago

@ghaerr, there is a fresh writeup on NIC driver development in the Developer Notes Wiki.

You may also find this one amusing, possibly even interesting :-): https://github.com/Mellvik/TLVC/wiki/The-ultimate-dev%E2%80%90environment

ghaerr commented 2 months ago

Very nice writing and write-ups @Mellvik. Both articles are very informative. I never knew about your Dream System (or Dream System II)!!

During reading, I noticed your statement "No buffers: The driver is moving data directly to/from the requester (via far mem-move), i.e. ktcp" which gave me more perspective for your earlier question about how ktcp will access far memory buffers (assumedly via read or write), which assume a pointer into the local ("near") data segment. Interestingly, this very problem came up during the port of the ELKS C library to Open Watcom in large model, where a far pointer is required for read/write, but the kernel only passes a near pointer.

The solution to allowing ktcp to read/write far buffers directly will be to use a wrapper function (perhaps readf/writef?) that loads DS just before the system call with the far memory segment. This, in turn ends up pushing the far segment DS onto the kernel stack and is later accessed via current->t_regs.ds which will have the far memory segment, rather than the program's default data segment, and everything works. This method can be used by any other system call passing a DS-relative pointer into the kernel. (The OWC syscall wrappers are at libc/watcom/syscalls/ for more detail). We should be able to do this without using an ASM .S file and we'll do this for gcc-ia16-elf when required.

Mellvik commented 2 months ago

Very interesting @ghaerr. Could you by any chance create such a wrapper for gcc? I don't see just off the bat how to load (and unload) the DS as you describe...

Dream System (taking the cue from Ubiquiti product naming :-) ) - maybe you should consider getting one? The fast one that is ...

ghaerr commented 2 months ago

Could you by any chance create such a wrapper for gcc?

Yes, we'll have to use some GCC black magic using its asm directive in a macro, something like:

#define loadDS(farptr) __extension__ ({  \
        unsigned int seg = (unsigned long)farptr >> 16;  \
        asm volatile ("mov %%ax,%%ds"  \
            : /* no output */  \
            :"a" seg \
            "memory"); \
        seg; })

size_t readf(int fd, char __far *buf, size_t count)
{
    loadDS(buf);
    return read(fd, (unsigned)buf, count);
}

The loadDS "expression" uses two GCC extensions, statement expressions and asm statement with machine constraints. The "memory" constraint isn't proper for notifying the compiler that DS has been modified - I'll have to further look into how to do this without previously saving and restoring DS around the system call.

ghaerr commented 2 months ago

@Mellvik, I figured out a pretty straightforward way to create syscall wrappers for gcc which set DS to the segment value of a passed far pointer, as well as all the other required registers (AX, BX, CX, and DX in this case), see ELKS Parameter Passing.

Here's an example for a far read (which uses system call 3):

#include <sys/types.h>

size_t readf(int fd, char __far *buf, size_t count)
{
    unsigned int seg = (unsigned long)buf >> 16;
    size_t ret;
    asm volatile ("push %%ds\n"
                  "mov %%ax,%%ds\n"
                  "mov $3,%%ax\n"
                  "int $0x80\n"
                  "pop %%ds\n"
                  : "=a" (ret)
                  : "a" (seg), "b" (fd), "c" ((unsigned)(void *)buf), "d" (count)
                 );
    return ret;
}

And check out this slimmed down, fast code generated (!):

# ia16-elf-objdump -D -r -Mi8086 readf.o
Disassembly of section .text:

00000000 <readf>:
   0:   89 e3                   mov    %sp,%bx
   2:   8b 47 06                mov    0x6(%bx),%ax // seg buf
   5:   8b 4f 04                mov    0x4(%bx),%cx // offset buf
   8:   8b 57 08                mov    0x8(%bx),%dx // count
   b:   8b 5f 02                mov    0x2(%bx),%bx // fd
   e:   1e                      push   %ds
   f:   8e d8                   mov    %ax,%ds
  11:   b8 03 00                mov    $0x3,%ax     // read
  14:   cd 80                   int    $0x80
  16:   1f                      pop    %ds
  17:   c3                      ret

Couldn't be any smaller :)

Mellvik commented 2 months ago

Very nice @ghaerr - thank you. gcc asm can obviously do magic, but for sure it's a black art...

ghaerr commented 2 months ago

Once we get to a basic level of driver source sharing, it will be easier for me to look at or suggest various far-buffer NIC or kernel solutions, etc by looking at actual code, instead of old code.

After taking a quick look at ktcp/tcp_output.c and the retransmit buffers, I suggest we stay with the above idea of getting the driver source sharing first, then start diving deeper into the messier problem of adding far buffers into application programs. Adding far buffers to the NIC drivers will probably be easier than adding them to applications, but that also needs research.

I now see (and remember) that ktcp's retransmit buffers are allocated individually using malloc when needed, instead of out of a single (near or far) memory buffer. Also, the retransmitted data isn't directly transferred via write, but goes through ip_sendpacket and further downstairs until actually being written using a system call. Each of those layers may need modification for far pointers...

I'm happy to suggest ideas and prepare both ourselves into the deeper dive, but also agree with your idea that we get some present networking enhancements or bug fixes first. Once things work well on your systems, then we will both will be working off of very similar if not identical NIC drivers and the ktcp source, which will be allow easier duplication (possibly with Dream hardware and working NIC card finally? :)

ghaerr commented 2 months ago

gcc asm can obviously do magic, but for sure it's a black art...

I'm happy to add whatever is needed when we get to that point...