Mellvik / TLVC

Tiny Linux for Vintage Computers

Other

9 stars 0 forks source link

Another megaupdate - focus on direct block IO drivers #16

Closed Mellvik closed 1 year ago

Mellvik commented 1 year ago

This PR includes a number of smaller fixes, including an (almost) full revert of the previous 'fix' (simplification) to the timer routines in the scheduler. It turns out I had misunderstood the code and introduced some hard to track down problems with my simplification. The code is now almost back to where it was, while keeping the original bug fix (NULL initialization of a pointer). Also, and just as important, a bug in the execve implementation has been fixed.

The floppy driver has had a number of small fixes and now seems stable and ready for optimizations. The direct HD driver has had minimal adjustments since the previous PR.

/bin/ash has its heap size increased by approximately 200 bytes. The internal testcommand would consistently fail with sbrk errors. This is unsurprising given the extra memory used by the recently introduced history mechanism.
/bin/ash has an updated set of (debug) trace points and trace content added while searching for the execve problem (below)
The minix bootblock would recognize a deleted copy of linux and try to load it. This has been fixed.
The kernel setup code would report /bootopts as malformed even if no bootopts file existed. This has been fixed.
Since the previous PR, selecting BIOS based block IO didn't compile. This has been fixed.
An elusive bug in the execve() system call has been fixed. Long story short (and a long search completed): The interrupt driven nature of the direct floppy driver and the slow speed of the floppy medium means that read requests queue up - in no particular order (at this time) and may be coming from different processes. When commands are read from the floppy and connected by a pipe, and if the first (source) command is much bigger than the second (receiver), the two will be loading concurrently and the latter (smaller) will complete first. A seemingly endless search commenced looking for the problem - passing through the pipe code, a lot of bin/ash code and eventually ending in the execve() system call, which in this case is being reentered. The implementation was not reentrant and thus caused all kinds of weird behaviour. The fixed sys_execve() is completely reentrant (and there was little reason for it not being so in the first place - except keeping the allocation of 56 bytes per entry off the stack).
A direct floppy driver bug caused floppy mounts just after QEMU startup to fail in the first attempt, then succeed. Adding a SENSEI command to the shaker_done routine fixed the problem (resetting and thus re-enabling interrupts). This problem never occurred on physical hardware. Since it only appeared the first time after start (not after reboot) on QEMU, it seems like a QEMU bug.

There are still a few debug statements compiled into the direct floppy code, related to floppy writes - while the hunt for optimisations continues.

ghaerr commented 1 year ago

An elusive bug in the execve() system call has been fixed.

When commands are read from the floppy and connected by a pipe, and if the first (source) command is much bigger than the second (receiver), the two will be loading concurrently and the latter (smaller) will complete first.

Nice fix!

Looking deeper into this problem, I think the issue is related to exactly how a process' kernel task (when executing a system call) may now be interrupted and rescheduled, versus before interrupt-driven block driver I/O was implemented, and all block I/O was synchronous.

The kernel itself is never allowed to be re-entered during a system call (with the exception of some special kernel interrupt handlers). That is to say, if a hardware interrupt occurs that requires a task switch and the system is executing kernel code at the time of the interrupt, the kernel will always be allowed to complete whatever system call is being executed (with the exception of specially-coded EINTR handling), and then, when finished and exiting the system call, schedule will be called to task switch. However, this is not the same as allowing kernel code to call schedule on behalf of the currently executing application. sleep_on etc do that all the time to pause execution and allow another to continue running.

What is now happening, is that, should any kernel task need to perform I/O (e.g. any filesystem call like read, write, open, etc), the system buffers are searched, and if the appropriate disk block buffer is not found, an I/O request is created and sleep_on is called, sleeping the process, with the kernel task stack frame containing the function call stack necessary for continuation of the system call when the I/O is complete. Previously, the I/O was synchronous, and a kernel task was never put to sleep in the middle of a filesystem function.

All of the kernel routines handling open, directory read, read, write (and possibly others) end up with the in-use filesystem code calling getblk/getblk32 which find or allocate a buffer, and if needed call ll_rw_blk to start I/O for the block contents, which ends up calling the appropriate block driver dev_request routine. For the direct FD and HD drivers, this routine starts the I/O, and since end_request isn't called before returning, readbuf/wait_on_buffer sleeps the task until end_request is called by the block driver.

In this bug, multiple applications were calling execve. sys_exec calls filp->f_op->read to read in the executable, and any block requiring a disk read allowed that kernel task to sleep. A previously slept kernel task would continue in sys_exec with its stack frame, and the static exe header variables aren't included - thus a big mess.

The reason the large exe header variables were declared static is that there can currently be an extreme shortage of kernel stack space, depending on the exact sequence of successive kernel routines called to execute the system call. One can only tell the exact kernel stack usage by running the system using CONFIG_STRACE and watching closely. It could be OK to place the exe header routines on the kernel stack, but it depends on the filesystem being used.

Unfortunately, the FAT filesystem driver is loaded with static declarations meant to reduce kernel stack usage. Many of these routines may likely ultimately end up calling getblk/getblk32 and the currently executing task slept. So any applications using FAT open/read/write/exec are not yet safe with async I/O. There may be other kernel routines outside of filesystem code that perform I/O, all these need inspection.

An obvious possibility would be to increase KSTACK_BYTES in limits.h. This would work well, except that the stack bytes used are "times 16" for the 16 static task structures, and the kernel is very close to running out of space in the kernel data segment. It is my recollection the kernel stacks are as small as possible, with ~100 bytes to spare from STRACE testing on multiple filesystems. Another possibility is to use up/down semaphores within the kernel, but this needs careful thought as it could quickly affect the current speedups. Another thought would be to dynamically allocate task kernel stack size individually, using some way of specifying what kinds of things the application might (or might not) do.

read requests queue up - in no particular order (at this time)

The MULTI_BH code will queue read and write requests in block number order in add_request using the "elevator algorithm".

Mellvik commented 1 year ago

Thank you @ghaerr, a very useful analysis indeed. Unsurprisingly I'm seeing more bugs/race conditions/whatever is the right term caused by this scenario. One of them inodes being mixed up (like: mv linux .. ; ls / and autosync happening concurrently => /linux (the dir entry) ends up pointing the ls´s inode).

Using KSTACK_TRACE for debugging this is a great tip, thanks. And changing the kernel stack in limits.h and see what happens too.

It seems to me there is only one path forward with this: Increase the kernel stack space. Which may in turn reduce the number of practically available task slots. Which brings up another ambition for TLVC: swapping. Most of the 'infrastructure' for swapping is already in place, and swapping to RAMdisk or XMS memory (EMS even) would be reasonably fast.

But first things first: Find and fix the places where reentrancy is required. Thanks for the hint about the FAT FS implementation.

BTW: Is the kernel capable of detecting its own stack overflow?

--M

PS: The directhd driver is not yet interrupt driven, which is a blessing until this is fixed. PS2: The raw driver obviously goes directly to ll_rw_blk, but that doesn't affect your analysis.

ghaerr commented 1 year ago

Using KSTACK_TRACE for debugging this is a great tip, thanks.

Use CONFIG_STRACE. It will show every system call executed, along with its arguments, return value, and kernel stack usage, along with kernel max stack ever used. Its quite verbose, so serial console to a scrollback window works well.

Is the kernel capable of detecting its own stack overflow?

Each application stack is checked during a system call entry by calling stack_check in arch/i86/kernel/irqtab.S just before strace is called (if CONFIG_STRACE is enabled). When strace is enabled, stack_check also fills in the kernel stack with 0x55, and then when the system call is exiting, strace counts how many bytes aren't 0x55 to calculate the amount of kernel stack used. In addition, the word at the kernel stack depth limit is set to a magic number (0x5476), and when strace is called, if that value isn't there, the kernel will panic with stack overflow.

So this kind of works well when strace is enabled, but the kernel stack isn't checked when not enabled, as well as when the kernel task is slept (i.e. set to a non-running state but not having exited the current system call) even when enabled. I suppose it could be a good idea to move the kernel stack-limit checking from strace into stack_check, which would then check for kernel stack overflow at syscall entry. (Better yet would be to call a kstack_check function on syscall exit or manually in places a developer thinks stack overflow may be occurring).

And changing the kernel stack in limits.h and see what happens too.

If CONFIG_STRACE shows an execution path getting close to limits, definitely. Strace also uses up even more kernel data segment space so sometimes drivers need to be removed to make more kernel data segment space, and the local kernel heap can get tight as well.

It seems to me there is only one path forward with this: Increase the kernel stack space. Which may in turn reduce the number of practically available task slots.

Ultimately, this would allow a more responsive system, although semaphores might be able to solve the FAT filesystem problem, with the caveat that users know "responsiveness might suffer, depending on system activity".

Over at ELKS, there's been discussion of dynamic allocation of task structures. This has a number of possibly very good benefits, essentially allowing the system to adapt a most precious resource, the kernel data segment, to whatever the system is trying to do, whether its more tasks for a short period of time, or more local heap (think serial or pipe buffers) available. A possible big downside would be kernel heap fragmentation that wouldn't allow a full task struct to be allocated, thus freezing forks. But this is probably unlikely.

(Speaking of which - the TLVC network drivers and some other code I was looking at - they could allocate their local buffers from the kernel heap at open time, deallocate at close, so that kernel data segment buffers were only allocated when needed. The tty/pty/serial/pipe/ide drivers do this, use heap_alloc and heap_free).

I think ELKS could be converted to dynamic task allocation without compromising system stability, even though there are quite a few changes involved. I'd like to get a v0.7.0 out, then perhaps implement it. A further (nonstandard) enhancement could really help - allowing the application to specify its own kernel stack size (or having fork specify it) - if the programmer knew what kinds of things an application may be used for, then perhaps its kernel stack size could be made significantly smaller (such as the kernel idle task, or applications that never did filesystem I/O). This would allow for more simultaneous tasks and/or certain daemons to be made smaller.

The directhd driver is not yet interrupt driven, which is a blessing until this is fixed.

I was thinking that another way to bring your system under control for close inspection might be to have a dynamic option to set NR_REQUESTS (the max size of the block driver request queue) to 1. Although that would still allow for async I/O, that would immediately stop any secondary applications from getting rescheduled/reentering the kernel, as there'd only be one outstanding I/O request at a time.

Which brings up another ambition for TLVC: swapping. Most of the 'infrastructure' for swapping is already in place, and swapping to RAMdisk or XMS memory (EMS even) would be reasonably fast.

Wow, you do have big plans! :) I'll have to take another look at good old UNIX V6 to see how it was implemented, I remember its implementation as being pretty small.

ghaerr commented 1 year ago

I'm seeing more bugs/race conditions/whatever is the right term caused by this scenario.

You mean the problems caused by async I/O? Agreed, but don't think this is a kernel stack size or reentrancy issue. I'm a little worried issues like that described below could be a buffer problem, and the buffer code is very complicated (not to mention that the BH_MULTI buffer code is entirely untested as I think it was grabbed from Linux and never compiled in).

One of them inodes being mixed up (like: mv linux .. ; ls / and autosync happening concurrently => /linux (the dir entry) ends up pointing the ls´s inode).

You mean mv /linux ..? That's a move to the same location, right? Not sure that mv should do anything in that case. Can you describe more exactly the kinds of problems you're seeing? How is auto sync is turned on, I remember it used to be run as part of the idle task but that proved very problematic... is it just a sync that's run through cron or another process every 30 seconds?

Also, you mention elsewhere that TLVC is "running out of its request queue" when doing a 50k floppy copy... What is the exact error message occurring? I'm wondering what exactly is happening. Surely the sector writes get queued in the low level block code quickly, but after one third of NR_REQUEST (24?) = 8 the kernel should sleep the requesting process waiting for an I/O completion in the request queue before continuing. I'm looking at the buffer and block driver code and don't see where a message might be printed for this kind of error.

Mellvik commented 1 year ago

Interesting read, @ghaerr, thank you. The stack_check mechanism is actually pretty smart. I agree with you, moving the stack_check mechanism is a good idea!

Over at ELKS, there's been discussion of dynamic allocation of task structures. This has a number of possibly very good benefits, essentially allowing the system to adapt a most precious resource, the kernel data segment, to whatever the system is trying to do, whether its more tasks for a short period of time, or more local heap (think serial or pipe buffers) available. A possible big downside would be kernel heap fragmentation that wouldn't allow a full task struct to be allocated, thus freezing forks. But this is probably unlikely.

Interesting - although it does sound like quite an undertaking, meaning quite a bit of code and footprint...

(Speaking of which - the TLVC network drivers and some other code I was looking at - they could allocate their local buffers from the kernel heap at open time, deallocate at close, so that kernel data segment buffers were only allocated when needed. The tty/pty/serial/pipe/ide drivers do this, use heap_alloc and heap_free).

Agreed, I actually had the same idea after using heap_alloc in the directhd driver. BTW there is a lot of work pending to optimize the buffer usage in the ethernet drivers. Right now far down the list though.

I think ELKS could be converted to dynamic task allocation without compromising system stability, even though there are quite a few changes involved. I'd like to get a v0.7.0 out, then perhaps implement it. A further (nonstandard) enhancement could really help - allowing the application to specify its own kernel stack size (or having fork specify it) - if the programmer knew what kinds of things an application may be used for, then perhaps its kernel stack size could be made significantly smaller (such as the kernel idle task, or applications that never did filesystem I/O). This would allow for more simultaneous tasks and/or certain daemons to be made smaller.

I'll keep an eye on this development.

The directhd driver is not yet interrupt driven, which is a blessing until this is fixed.

I was thinking that another way to bring your system under control for close inspection might be to have a dynamic option to set NR_REQUESTS (the max size of the block driver request queue) to 1. Although that would still allow for async I/O, that would immediately stop any secondary applications from getting rescheduled/reentering the kernel, as there'd only be one outstanding I/O request at a time.

That's an interesting idea indeed. And it's an easy thing to switch back and forth. Thanks.

Which brings up another ambition for TLVC: swapping. Most of the 'infrastructure' for swapping is already in place, and swapping to RAMdisk or XMS memory (EMS even) would be reasonably fast.

Wow, you do have big plans! :) I'll have to take another look at good old UNIX V6 to see how it was implemented, I remember its implementation as being pretty small.

Spot on, Lion's Commentary on v6 source is on my desk ...

--M

Mellvik commented 1 year ago

Thanks for the comments, @ghaerr.

jul. 2023 kl. 01:31 skrev Gregory Haerr @.***>:

I'm seeing more bugs/race conditions/whatever is the right term caused by this scenario.

You mean the problems caused by async I/O? Agreed, but don't think this is a kernel stack size or reentrancy issue. I'm a little worried issues like that described below could be a buffer problem, and the buffer code is very complicated (not to mention that the BH_MULTI buffer code is entirely untested as I think it was grabbed from Linux and never compiled in).

OK, that’s possible (BTW, BH_MULTI is not in use). While researching the execve problem, I carefully traced through the entire buffer system w/o finding anything suspicious, so reentrancy is currently my primary suspect followed by stack issues. One of them inodes being mixed up (like: mv linux .. ; ls / and autosync happening concurrently => /linux (the dir entry) ends up pointing the ls´s inode).

You mean mv /linux ..? That's a move to the same location, right? Not sure that mv should do anything in that case. Can you describe more exactly the kinds of problems you're seeing? How is auto sync is turned on, I remember it used to be run as part of the idle task but that proved very problematic... is it just a sync that's run through cron or another process every 30 seconds?

The example is correct, moving the file ‘linux’ (could be anything, but it has significant size to it’s a good target) up one level. Sync is unchanged from ELKS, governed by the SYNC= bootopts directive, but how sync is done doesn’t seem to matter, it may be manual for that matter. It’s when it happens that matter, and the problem, while occurring frequently is not repeatable in any exact way. And rarely happens on QEMU, more often as the hardware gets slower. And only when running off of floppy as root.

BTW, here is another interesting one:

boot from directfd, root on directfd
mount Fat partition on /mnt
ls /mnt (from serial console) - only a few files show
echo /mnt/* — all files show
ls /mnt (still serial console) again, more files show (but not all) -> The number of files listed by ls is random, sometimes all, sometimes a few, and it’s not restricted to a particular directory on the fat fs. -> Debug output from ls shows that it always gets all files from directory reads. -> The problem happens only when logged in via serial, not telnet, not console. -> For a while it seems to not affect system operation, but eventually the file system on the floppy becomes totally screwed up and non-repairable. It’s a rabbit hole I’ve avoided so far - assuming we’re likely seeing 2nd or 3rd order consequences of reentrancy problems or stack problems. First on my list is to take a look at what will be overwritten if the stack grows too large. But this comes after finding and fixing the problem above - and may actually have the same root cause.

Also, you mention elsewhere that TLVC is "running out of its request queue" when doing a 50k floppy copy... What is the exact error message occurring? I'm wondering what exactly is happening. Surely the sector writes get queued in the low level block code quickly, but after one third of NR_REQUEST (24?) = 8 the kernel should sleep the requesting process waiting for an I/O completion in the request queue before continuing. I'm looking at the buffer and block driver code and don't see where a message might be printed for this kind of error.

This is not a problem, just a consequence of the async nature of floppy operations. Since write requests are being queued and the call returning, they will be piling up fast, something that could never happen with the BIOS drivers. Which mandated introducing a wait and wake_up regime in the make_request routine. A small thing really, although it’s on my list to experiment with the size of the request queue (currently increased from 20 to 24).

-M

ghaerr commented 1 year ago

BTW, here is another interesting one:

boot from directfd, root on directfd

mount Fat partition on /mnt

ls /mnt (from serial console) - only a few files show

echo /mnt/* — all files show

ls /mnt (still serial console) again, more files show (but not all) -> The number of files listed by ls is random, sometimes all, sometimes a few, and it’s not restricted to a particular directory on the fat fs. -> Debug output from ls shows that it always gets all files from directory reads. -> The problem happens only when logged in via serial, not telnet, not console. -> For a while it seems to not affect system operation, but eventually the file system on the floppy becomes totally screwed up and non-repairable. It’s a rabbit hole I’ve avoided so far - assuming we’re likely seeing 2nd or 3rd order consequences of reentrancy problems or stack problems. First on my list is to take a look at what will be overwritten if the stack grows too large. But this comes after finding and fixing the problem above - and may actually have the same root cause.

With the above, the problem is most definitely caused by kernel routines being reentered in the FAT filesystem driver, with the same screwy effects likely seen before sys_exec was fixed. Some time ago, when STRACE showed the system had kernel stack overflow (by crashing or same sort of stuff) when the FAT filesystem was used, I purposely added a bunch of static declarations to all (even small) buffer declarations in fs/msdos/*.c. These are all clearly marked as:

./dir.c:    /* static not reentrant: conserve stack usage*/
./dir.c:            static char longname[14]; /* static not reentrant: conserve stack usage*/
./dir.c:    /* static not reentrant: conserve stack usage*/
./namei.c:  /* static not reentrant: conserve stack usage*/
./namei.c:  /* static not reentrant: conserve stack usage*/
./namei.c:  /* static not reentrant: conserve stack usage*/
./namei.c:  /* static not reentrant: conserve stack usage*/

In order to test this, consider doing the following:

Run the test with CONFIG_STRACE to observe current (static) kernel max stack usage
Remove static declarations identified above in FAT driver
Increase KSTACK_BYTES by required amount (may be large)
If large, possibly decrease MAX_TASKS from 16 in order to link kernel
Test again with CONFIG_STRACE to see no stack overflow
Test again without CONFIG_STRACE to see problem fixed

Since write requests are being queued and the call returning, they will be piling up fast, something that could never happen with the BIOS drivers. Which mandated introducing a wait and wake_up regime in the make_request routine. A small thing really, although it’s on my list to experiment with the size of the request queue (currently increased from 20 to 24).

I see - so you're saying the kernel runs well and does sleep the task doing all the writes after 1/3 NR_REQUESTS? That seems to be what __get_request_wait is supposed to do. It might be interesting to display the current size of the read and write queues in get_request on each I/O request. I would think that the idea isn't to increase the queue size too much, but rather to increase it just enough so that the normal mix of applications allows high responsiveness without allowing excess I/O to queue up. That is, the queue size would be just large enough to allow all I/O-bound tasks to queue a single I/O before the first request finishes, and allows the app CPU / I/O cycle to continue...

(BTW, BH_MULTI is not in use).

Thanks, I missed that. I will take a deeper look - it had seemed to me that this option would want to be defined for async I/O.

Mellvik commented 1 year ago

That's a lot of interesting input @ghaerr.

Since write requests are being queued and the call returning, they will be piling up fast, something that could never happen with the BIOS drivers. Which mandated introducing a wait and wake_up regime in the make_request routine. A small thing really, although it’s on my list to experiment with the size of the request queue (currently increased from 20 to 24).

I see - so you're saying the kernel runs well and does sleep the task doing all the writes after 1/3 NR_REQUESTS? That seems to be what __get_request_wait is supposed to do.

That's right, it still does- adapted to the current setting. It might be interesting to display the current size of the read and write queues in get_request on each I/O request. I would think that the idea isn't to increase the queue size too much, but rather to increase it just enough so that the normal mix of applications allows high responsiveness without allowing excess I/O to queue up. That is, the queue size would be just large enough to allow all I/O-bound tasks to queue a single I/O before the first request finishes, and allows the app CPU / I/O cycle to continue...

That an interesting angle. There are many conditions affecting this, not the least the speed of the io device. Thanks, it's now on my list to investigate. (BTW, BH_MULTI is not in use).

Thanks, I missed that. I will take a deeper look - it had seemed to me that this option would want to be defined for async I/O.

Only partially true I admit. Elements of MULTI_BH are being used, such as the __get_request_wait discussed above. The major part of MULTI_BH is about speed optimization which was deemed too bloated when I adapted the driver. Interesting ideas though, with the plug/unplug concept, effectively blocking io to create a queue and the let go.

I might decide to look closer at it later, but it's useful for floppies only.

Thanks.

-M

ghaerr commented 1 year ago

I was thinking that another way to bring your system under control for close inspection might be to have a dynamic option to set NR_REQUESTS (the max size of the block driver request queue) to 1.

Your work has inspired me to clean up some of source in the block I/O system on ELKS. While doing that, I realized that ELKS itself had allocated a request queue with NR_REQESTS = 20, but with synchronous I/O could save memory using only 1. However, I found that when set to 1, the make_request code handling WRITEs calculating max_request at 2/3 NR_REQEST set the value to 0, which ended up with an array out of bounds error. So more code needs to be changed for NR_REQUEST = 1 to work to force synchronous I/O, FYI! I ended up using NR_REQESTS = 2 for the time being.

Also, was looking at the IN_ORDER macro, and noticed that ELKS was using an older elevator algorithm where it was thought that priortizing reads over writes helped filesystem throughput. In Linux 2.0, this was changed to sector order only with no read/write prioritization, with a statement that this improved throughput. I noticed your statement in TLVC that the algorithm has been updated to "standard" Linux, and your algorithm prioritizes writes over reads - did you find that Linux changed this yet again in versions after v2.0?

Mellvik commented 1 year ago

Very interesting @ghaerr.

I was thinking that another way to bring your system under control for close inspection might be to have a dynamic option to set NR_REQUESTS (the max size of the block driver request queue) to 1.

Your work has inspired me to clean up some of source in the block I/O system on ELKS. While doing that, I realized that ELKS itself had allocated a request queue with NR_REQESTS = 20, but with synchronous I/O could save memory using only 1. However, I found that when set to 1, the make_request code handling WRITEs calculating max_request at 2/3 NR_REQEST set the value to 0, which ended up with an array out of bounds error. So more code needs to be changed for NR_REQUEST = 1 to work to force synchronous I/O, FYI! I ended up using NR_REQESTS = 2 for the time being.

Thanks for sharing, that helps - and actually makes sense. With synchronous io only, it would make sense to scale the size of the request queue with the number of devices, since there may be parallel activities that should not have to wait in line. Possibly even per partition, I haven't thought that one through yet. Pending that, 16 or 20 may not be all that bad a choice. Like I (think I) said before, 24 was experimental.

Also, was looking at the IN_ORDER macro, and noticed that ELKS was using an older elevator algorithm where it was thought that priortizing reads over writes helped filesystem throughput. In Linux 2.0, this was changed to sector order only with no read/write prioritization, with a statement that this improved throughput. I noticed your statement in TLVC that the algorithm has been updated to "standard" Linux, and your algorithm prioritizes writes over reads - did you find that Linux changed this yet again in versions after v2.0?

No, my research didn't go that far. It has been a while, but I remember looking at the elks variant vs the current (which was commented out in the code IIRC), and chose this because it made more sense. I also made a note to come back later when the floppy driver was done.

Since then I found that the buffer system is doing the opposite. Sync_buffers is writing out all dirty buffers backwards, giving the request code alot of work to do. An easy fix, but I haven't had time to evaluate the results yet. Not much time to work on this recently, but the elusive problem I've been chasing - which seemed like reentrancy causing buffer pollution, is in the request code in ll_rw_blk. One or more writes are waiting for available requests, and when released they sometimes pick the data from a recent read buffer. Race condition - hope to get to look at it tomorrow.

Actually, the stack tracing turned out to be really helpful in chasing this. Not the data delivered (stacks are ok foe now) but the way it slows down the system. Finally a predictable, repeatable error situation - although only on my slowest machine.

Thanks, M

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

ghaerr commented 1 year ago

With synchronous io only, it would make sense to scale the size of the request queue with the number of devices, since there may be parallel activities that should not have to wait in line.

Actually, I want to make sure we're on the same page here - with synchronous I/O, and the way that add_request works, there will never be more than one request structure in use at any time, regardless of the number of block devices. That is, add_request always first checks the head of the block device's request queue, (which will be necessarily empty, since all I/O is synchronous), and if empty immediately calls the driver's request function (do_bioshd_request, etc). In ELKS, that function won't/can't return until I/O is complete, at which point end_request is called, which removes the request from the driver's request queue. So only a single request buffer is ever used, period, even if 2 are declared.

I mention all this because I had originally thought that TLVC could be throttled back to eliminate non-race problems by setting NR_REQUESTS to 1 (as well as my attempt at setting it to 1 on ELKS), except that I found the following line of code in get_request will cause an array out-of-bounds, even on the first write request:

        req = ((req > all_requests) ? req : limit) - 1;

That is, when NR_REQEST is 1, two thirds of 1 is 0, limit is set to &all_requests, and the above line resets the req pointer to a value one less than the start of the all_requests array (and subsequently scribbles kernel memory)! That's why I had to set NR_REQESTS to 2. I plan on looking into a better fix in make_request so that this doesn't happen. I think changing the following line in make_request should solve the problem (when NR_REQUEST is set to 1):

max_req = (NR_REQUEST * 2) / 3;
(change to):
max_req = NR_REQUEST;

With this mod in place, things should work with fully async I/O, except there'd only be a single request outstanding at a time (but still allowing for the I/O complete interrupt callback to release its buffer, thus testing that aspect of the new system).

One or more writes are waiting for available requests, and when released they sometimes pick the data from a recent read buffer. Race condition

Yes, I am very interested to learn whether this is a problem with kernel buffer code, or something else, like a driver bug. That's why if the request buffer could be narrowed down to only one, the system could still run async I/O but eliminate multiple race issues - and it would be interesting to see if the bug is still seen or not.

The request structure has fields for the buffer head, as well as the block number. Perhaps displaying req->rq_blockno and req->rq_bh->b_blocknr as well the (supposedly matching) unlock_buffers values for the same block number would show more information as to where/when this is occurring.

the stack tracing turned out to be really helpful in chasing this. Not the data delivered (stacks are ok foe now) but the way it slows down the system.

After thinking through these issues, I got kind of excited and ended up pulling together a bunch of kernel consistency checks for buffer code (e.g. guaranteeing the buffer free count is accurate), scheduler checks (e.g. checking for schedule being erroneously called from an interrupt handler), block driver checks, as well as rewriting strace completely to allow it to be dynamically turned on and off, and adding full-blown kernel stack and interrupt stack overflow and near overflow checking. (PR #1637). I plan on using it to look deeper into the kernel stack and interrupt stack usage with the FAT (and for ELKS, sys_execv) buffers declared statically versus not. The new stack tracing will show the max stack usage per system call per process, to get a very accurate feel. I guess you could say your work is motivating me to figure out what still might be wrong with the ELKS kernel, and otherwise verify correctness during development :)

I remember looking at the elks variant vs the current (which was commented out in the code IIRC), and chose this because it made more sense. I also made a note to come back later when the floppy driver was done. Since then I found that the buffer system is doing the opposite. Sync_buffers is writing out all dirty buffers backwards, giving the request code alot of work to do.

Yes, it seems sync_buffers traverses the buffer LRU list and calls ll_wr_blk on each one of them that's marked dirty. It's possible this could be rewritten to traverse the list in the opposite order, if from what you've been seeing you think that would help.

I have some more thoughts on the syncing, got from after studying the bdflush mechanisms of Linux and other OSes. It seems that Linux has taken up the idea of limiting the number of buffers written each cycle, along with some other nice changeable parameters, in order to not bog down the system too much.

If you have other test shell scripts that heavily exercise the system, I'd like to try some of them out. I remember a few years back we saw an ELKS buffer problem, which I can't remember how it was fixed; currently I'm not able to get anything to break (synch I/O only, of course). It would be nice to completely eliminate the possibility any remaining buffer problems occurring with synchronous I/O. (It comes to mind that the L1 cache may also be an issue with async I/O, I'll have to think about that a bit more).

Mellvik commented 1 year ago

Actually, I want to make sure we're on the same page here - with synchronous I/O, and the way that add_request works, there will never be more than one request structure in use at any time, regardless of the number of block devices. That is, add_request always first checks the head of the block device's request queue, (which will be necessarily empty, since all I/O is synchronous), and if empty immediately calls the driver's request function (do_bioshd_request, etc). In ELKS, that function won't/can't return until I/O is complete, at which point end_request is called, which removes the request from the driver's request queue. So only a single request buffer is ever used, period, even if 2 are declared.

Sorry, my bad. My head was on a different stack I've been studying. To be revisited. There may be a race condition right there, in add_request.

I mention all this because I had originally thought that TLVC could be throttled back to eliminate non-race problems by setting NR_REQUESTS to 1 (as well as my attempt at setting it to 1 on ELKS), except that I found the following line of code in get_request will cause an array out-of-bounds, even on the first write request: req = ((req > all_requests) ? req : limit) - 1;

That is, when NR_REQEST is 1, two thirds of 1 is 0, limit is set to &all_requests, and the above line resets the req pointer to a value one less than the start of the all_requests array (and subsequently scribbles kernel memory)! That's why I had to set NR_REQESTS to 2. I plan on looking into a better fix in make_request so that this doesn't happen. I think changing the following line in make_request should solve the problem (when NR_REQUEST is set to 1): max_req = (NR_REQUEST * 2) / 3; (change to): max_req = NR_REQUEST;

With this mod in place, things should work with fully async I/O, except there'd only be a single request outstanding at a time (but still allowing for the I/O complete interrupt callback to release its buffer, thus testing that aspect of the new system).

Agreed, without the req queue there is no problem. Also, using kernel buffers only - which means just a few, I forget the number off hand, seemingly eliminates the problem. I do suspect though that I can force it back by giving a sync command at exactly the right moment. What triggers the problem is a sync while still writing more data.

One or more writes are waiting for available requests, and when released they sometimes pick the data from a recent read buffer. Race condition

Yes, I am very interested to learn whether this is a problem with kernel buffer code, or something else, like a driver bug. That's why if the request buffer could be narrowed down to only one, the system could still run async I/O but eliminate multiple race issues - and it would be interesting to see if the bug is still seen or not. The request structure has fields for the buffer head, as well as the block number. Perhaps displaying req->rq_blockno and req->rq_bh->b_blocknr as well the (supposedly matching) unlock_buffers values for the same block number would show more information as to where/when this is occurring.The buffer code is in the clear. I've verified that the data passed down to the req level is correct.

the stack tracing turned out to be really helpful in chasing this. Not the data delivered (stacks are ok foe now) but the way it slows down the system.

After thinking through these issues, I got kind of excited and ended up pulling together a bunch of kernel consistency checks for buffer code (e.g. guaranteeing the buffer free count is accurate), scheduler checks (e.g. checking for schedule being erroneously called from an interrupt handler), block driver checks, as well as rewriting strace completely to allow it to be dynamically turned on and off, and adding full-blown kernel stack and interrupt stack overflow and near overflow checking. (PR #1637).

Well, how 'bout that, exciting indeed! I'll have a good look at that. In addition to the actual data we get from new tools I'm always curious about the 'load' they represent- as in how much they affect the target. It's incredibly convenient to debug on qemu, but on qemu everything is synchronous thanks to the speed. Even floppy io with stack tracing.

I plan on using it to look deeper into the kernel stack and interrupt stack usage with the FAT (and for ELKS, sys_execv) buffers declared statically versus not. The new stack tracing will show the max stack usage per system call per process, to get a very accurate feel. I guess you could say your work is motivating me to figure out what still might be wrong with the ELKS kernel, and otherwise verify correctness during development :)

Glad to hear that :-)

I remember looking at the elks variant vs the current (which was commented out in the code IIRC), and chose this because it made more sense. I also made a note to come back later when the floppy driver was done. Since then I found that the buffer system is doing the opposite. Sync_buffers is writing out all dirty buffers backwards, giving the request code alot of work to do.

Yes, it seems sync_buffers traverses the buffer LRU list and calls ll_wr_blk on each one of them that's marked dirty. It's possible this could be rewritten to traverse the list in the opposite order, if from what you've been seeing you think that would help.

I think I changed 4 letters in 2 or 3 lines to reverse it :-).

I have some more thoughts on the syncing, got from after studying the bdflush mechanisms of Linux and other OSes. It seems that Linux has taken up the idea of limiting the number of buffers written each cycle, along with some other nice changeable parameters, in order to not bog down the system too much.

I've been considering adding a 3rd arg to sync_buffers and experiment with that, now that I have the rigging in place.

If you have other test shell scripts that heavily exercise the system, I'd like to try some of them out. I remember a few years back we saw an ELKS buffer problem, which I can't remember how it was fixed; currently I'm not able to get anything to break (synch I/O only, of course). It would be nice to completely eliminate the possibility any remaining buffer problems occurring with synchronous I/O. (It comes to mind that the L1 cache may also be an issue with async I/O, I'll have to think about that a bit more).

I've put some extra protection in a couple of places in the buffer code during the process. The plan is to eliminate one after the other when things are smooth. No scripts, just command sequences found to trigger the problems. For a while incoming ftp was a sure hit, now copy from hd to floppy works to trigger the problem. Which - by the way - is that some blocks in the file written have bad data, data that occur later in the file.

Thanks, M

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

ghaerr commented 1 year ago

There may be a race condition right there, in add_request.

Looking at add_request, I find that interrupts are enabled before calling the device request function:

static void add_request(struct blk_dev_struct *dev, struct request *req)
{
    struct request *tmp;

    clr_irq();
    mark_buffer_clean(req->rq_bh);
    if (!(tmp = dev->current_request)) {
        dev->current_request = req;
        set_irq(); // <- try commenting this out
        (dev->request_fn) ();
    }

In reading through the kernel source, I've found conflicting information as to whether the device request function is called with interrupts enabled, or disabled. It might be worth commenting out the interrupt enable above. This could possibly result in interrupts remaining disabled for longer than desired (at least in the short term), but that could be figured out later if this were to fix the race problem.

I do suspect though that I can force it back by giving a sync command at exactly the right moment. What triggers the problem is a sync while still writing more data.

Now that is very interesting! Are you saying that the current race problem can usually be triggered by a sync during or near I/O? If so, then perhaps the following could be done, the current sync loops twice through the buffers, the second time waiting for I/O to complete using wait_on_buffer. If the following line were to be commented out, this would stop the second loop through, at the cost of not syncing an I/O buffer that's in progresss at the time of sync:

void fsync_dev(kdev_t dev)
{
    sync_buffers(dev, 0);
    sync_supers(dev);
    sync_inodes(dev);
    sync_buffers(dev, 1); // <- try commenting out this line
}

If this works, then it would seem there's a problem in wait_on_buffer. Even with the fix in, removing the last line above would cause sync problems if I/O were ongoing during an unmount, but that also could be figured out later if this turns out to be the culprit.

Also, using kernel buffers only - which means just a few, I forget the number off hand, seemingly eliminates the problem.

When you say "kernel buffers only" - do you mean non-XMS buffers, or also non-EXT (main memory) buffers, that is, only using the L1 buffers and no XMS or EXT buffers?

I'm always curious about the 'load' they represent- as in how much they affect the target. It's incredibly convenient to debug on qemu, but on qemu everything is synchronous thanks to the speed. Even floppy io with stack tracing.

QEMU actually implements an asynchronous HW timer interrupt, but yes, everything is so fast that the system operates basically synchronously, at least as far as disk I/O is concerned. I was thinking that one could change that by slightly modifying the floppy read/write command complete interrupt: instead of immediately calling end_request(1) which dequeues the request and wakes up any waiting tasks (for either the buffer or the request queue), a kernel timer could be set for a typical disk read/write time and having the kernel callback a routine that would perform the end_request after the timeout. This should work to allow QEMU to debug async I/O.

ghaerr commented 1 year ago

then it would seem there's a problem in wait_on_buffer

I looked hard at wait_on_buffer, and it is unsafe for async I/O (although I'm not sure if this is the exact problem you're seeing). The problem is that in the current version, bh->b_locked is checked, and if its locked, meaning I/O is in progress, then the process is put to sleep. The problem is, (a classic "test a variable but its changed by an interrupt handler"), that unlock_buffer is called by end_request, and that could change the value back to unlocked, and the process is never woken up. This process could be /bin/init, which happens to be handling the auto-sync mechanism, so in certain race cases, the buffer would never be written out to disk, and the user wouldn't ever know it.

I just fixed this in ELKS (PR 1640), and suggest something like the following for TLVC:

void wait_on_buffer(struct buffer_head *bh)
{
    ext_buffer_head *ebh = EBH(bh);
    INR_COUNT(ebh);
    wait_set((struct wait_queue *)bh);       /* use bh as wait address */
    for (;;) {
        current->state = TASK_UNINTERRUPTIBLE;
        if (!ebh->b_locked)
            break;
        schedule();
    }
    wait_clear((struct wait_queue *)bh);
    current->state = TASK_RUNNING;
    DCR_COUNT(ebh);
}

Mellvik commented 1 year ago

Thanks again, @ghaerr - There is a lot of good reasoning here and - unsurprisingly - we’re frequently thinking along the same lines.

I’ve been experimenting with irq protection at various places including the one you suggested (and tested this particular one again just to make sure) with no success. I also tested a small value on NO_REQUESTS (4), which changes the block # where the erroneous data start, but nothing else. Even the sync change was tested a while back, and for a short moment it seemed to fix it, but it turned out it was the timing of the sync that made a difference, not the sync itself.

Since then the test case has been narrowed down so it’s completely repeatable (not the least helped by keeping the stack trace going) - and I’ve finally been able to pinpoint the exact problem - to be detailed in a separate message.

Also, using kernel buffers only - which means just a few, I forget the number off hand, seemingly eliminates the problem.

When you say "kernel buffers only" - do you mean non-XMS buffers, or also non-EXT (main memory) buffers, that is, only using the L1 buffers and no XMS or EXT buffers?

That's right, L1 buffers only. And it emliminates the problem - seemingly with that few buffers the cause of the problem never arises (as I found out later, L1 buffers are involved in the problem, but in a very peculiar way). I'm always curious about the 'load' they represent- as in how much they affect the target. It's incredibly convenient to debug on qemu, but on qemu everything is synchronous thanks to the speed. Even floppy io with stack tracing.

QEMU actually implements an asynchronous HW timer interrupt, but yes, everything is so fast that the system operates basically synchronously, at least as far as disk I/O is concerned. I was thinking that one could change that by slightly modifying the floppy read/write command complete interrupt: instead of immediately calling end_request(1) which dequeues the request and wakes up any waiting tasks (for either the buffer or the request queue), a kernel timer could be set for a typical disk read/write time and having the kernel callback a routine that would perform the end_request after the timeout. This should work to allow QEMU to debug async I/O.

Actually, this is a very interesting idea. Using the QEMU flag in /bootopts could enable this delay. And very uncomplicated…

—M

Mellvik commented 1 year ago

@ghaerr,

the elusive 'data pollution' problem has finally been located and there is a crude fix, the real fix pending a real 'root cause discovery'.

The problem originate in this piece of code from make_request:

#if defined(MULTI_BH) || defined(FLOPPYDISK)
        req = __get_request_wait(max_req, buffer_dev(bh));
#else
        panic("Can't get request.");
#endif

    }

    /* fill up the request-info, and add it to the queue */
    clr_irq();
    req->rq_cmd = (__u8) rw;
    req->rq_bh = bh;
#ifdef CONFIG_BLK_DEV_CHAR
    if (blks) {         /* raw device access, blocks = sectors */
        req->rq_buffer = bh->b_data; /* pointing to process space */
        req->rq_seg = ebh->b_seg;
        req->rq_nr_sectors = blks;
        req->rq_blocknr = ebh->b_blocknr;
    } else
#endif
    {
        req->rq_blocknr = buffer_blocknr(bh) * BLOCK_SIZE / 512;
        req->rq_seg = buffer_seg(bh);         
        req->rq_buffer = buffer_data(bh);      
    }

#ifdef BLOAT_FS
    req->rq_nr_sectors = count;
    req->rq_current_nr_sectors = count;
#endif

    req->rq_next = NULL;
    set_irq();
    printk("|seg%lx|", (unsigned long)req->rq_seg); 
    add_request(&blk_dev[major], req);
}

The CONFIG_BLK_DEV_CHAR is off for simplicity. What happens is that the data in req->rq_seg and req->rq_buffer sometimes have L1 values instead of the correct L2 values. All other data being copied into the REQ are fine, including the bh pointer - which evidently has correct data. SO while req->rq-bh->b_seg is correct, req->rq_seg is not - in spite of the assignment and the irq protection.

Here's an excerpt from redo_fd_floppy with some messy debug output that demonstrate the case:

        if (MAJOR(req->rq_dev) != MAJOR_NR)
            panic("%s: request list destroyed", DEVICE_NAME);
        if (req->rq_bh && !EBH(req->rq_bh)->b_locked)
                panic("%s: block not locked", DEVICE_NAME);
    }
    seek = 0;
    probing = 0;
    device = MINOR(CURRENT_DEVICE) >> MINOR_SHIFT;
    drive = device & 3;
    if (device > 3)
        floppy = (device >> 2) + floppy_type;
    else {                      /* Auto-detection */
        floppy = current_type[drive];
        if (!floppy) {
            probing = 1;
            floppy = base_type[drive];
            if (!floppy) {
                request_done(0);
                goto repeat;
            }
            if (CURRENT_ERRORS & 1)
                floppy++;
        }
    }
/*DEBUG STUFF*/
    content = _MK_FP(req->rq_seg, (unsigned int) req->rq_buffer);
    content2 = _MK_FP(buffer_seg(req->rq_bh), (unsigned int) buffer_data(req->rq_bh));
    if (*content != *content2)
        printk("Data error blk %d(%lx/%04x:%lx/%04x)\n", (int)req->rq_blocknr,
        (unsigned long)req->rq_seg, (unsigned) req->rq_buffer,
        (unsigned long) buffer_seg(req->rq_bh), (unsigned) buffer_data(req->rq_bh));
    printk("flpio-%c (%04x) b/c %u/%04x\n", req->rq_cmd == WRITE? 'W':'R', req,
                (int)req->rq_blocknr, *content);
/*DEBUG END*/
    if (format_status != FORMAT_BUSY) {
        unsigned int tmp;
        if (current_drive != drive) {
            current_track = NO_TRACK;
            current_drive = drive;
        }

And sure enough, the bufheads always deliver correct data, the reqs sometimes fail (pls ignore all the other clutter from the debug output excerpt) (0x13f2 is the kernel DS):

|seg2400|AQdc78|e8e8;ERdc66|e9c0;flpio-W (dc54) b/c 1962/83ff
|seg2400|AQdc66|ea68;ERdc54|e9d8;Data error blk 1964(13f2/ab54:2400/3800)
flpio-W (dc42) b/c 1964/0901
|seg2400|AQdc54|e828;ERdc42|e900;Data error blk 1966(13f2/b354:2400/2000)
flpio-W (dc30) b/c 1966/03cc
|seg2400|AQdc42|e858;ERdc30|e870;Data error blk 1968(13f2/8b54:2400/4c00)
flpio-W (dc1e) b/c 1968/fe4e
|seg2400|AQdc30|e990;ERdc1e|e978;Data error blk 1970(13f2/9754:2400/2800)
flpio-W (dc0c) b/c 1970/9d56
|seg2400|AQdc1e|e840;ERdc0c|e8a0;flpio-W (dcd2) b/c 6/ffff
|seg2400|AQdc0c|e8d0;ERdcd2|e7f8;flpio-W (dcc0) b/c 1942/03cc

[legend: AQ means add_request ('AD' if the queue was empty), the preceding 'seg????' is the seg value set in the req, the following # is the req address. ER is end_request with the req addr and the bufhead addr. ´flpio´ is from the request routine in the driver followed by blk# and the first 2 bytes of content.]

More or less obvious - as mentioned above, the band aid is to use pointers from the bufhead instead of the req in the floppy driver, but it doesn't solve the underlying problem.

How do we get to the point where this happens? Some important details:

NO_REQUESTS is 20
EXT buffers @ 32
The erroneous blocks start at data block 18 (usually), which also means that smaller files rarely have errors when copied one by one. When copying several say 16k files (like cp /mnt/tmp/{p1,p2,p3,p4} .) in one swoop, they're all likely to have errors. So it's not the size of the files that count but the activity on the system - in particular the activity on the async device. Thus the comments about manual sync affecting the problem.
Usually there are 4 consecutive blocks in error, occasionally fewer. And while the first error in the test case (copy 40M file from a mounted FAT fs to floppy) is always at the mentioned block 18, the distance to the next occurence of error blocks is not consistent. BTW, the source of the copy is not important, it may be FTP, HD or floppy.
The error may obviously happen to metadata blocks too, and the file system becomes completely screwed.
The erroneous content is (obviously) data that has recently passed thru the L1 cache, like data from a different place in the same file, a block of indirect blocks, an inode block, whatever.

The remaining challenge is to find what's triggering the error. Observations:

The problem (seems to) affect rq_seg and rq_buffer only, which is strange.
When the req data is wrong, it's always L1 replacing L2.
The problem occurs only when the request queue is being active loading (added to) and unloading (written to physical) at the same time, i.e. internal or external sync (my recent testing use only internal sync (find_free_bufferwhich triggers sync_buffers)). First the buffer cache fills up, then the request queue fills up (because each request takes a long time), then a new req is added every time one is released.

Given this mixup between L1 and L2, and with no other suspects at the time, I intend to take a close look at map_buffer/unmap_buffer next. Even though the actual payload data doesn't use L1, the metadata does, and I'm frequently seeing a block containing indirect block pointers (belonging to the actual file at hand) in the erroneous data.

Close but no cigar.

-M

BTW - the observations indicate a need for a sleep_on in the waiting-for-buffer-available loop, after say 5 loops, when it's obvious it will take some time before the renounce become available again. Whether 20 req's are useful remains to be seen. I've decided to keep it that way until the async IDE driver has been tested with 2 floppy drives concurrently.

ghaerr commented 1 year ago

Wow, @Mellvik you are an amazing debugger!!! Super job finding and describing the exact problem.

From your descriptions, and looking at map_buffer, I think I know (finally) what is going on. First, let me describe the map/unmap buffer process in the context of your descriptions:

When kernel the filesystem needs to access file data, it calls map_buffer to copy otherwise inaccessible data (i.e. far data for ext buffers or XMS addressed data) so that an near pointer can be used to access the buffer contents, using a hard-coded bh->b_data[] etc. When done, unmap_buffer is called to release it. A b_mapcount field is incremented on map and decremented on unmap. Both these functions can be called repeatedly before any block I/O is later actually attempted, based on the b_dirt bit and syncing.

When b_mapcount is nonzero, the b_data field points to L1, otherwise b_L2data points to the buffer. For code that doesn't need to directly access buffer data directly (e.g. only whole blocks are being looked at), the buffer_data function is used for access:

char *buffer_data(struct buffer_head *bh)
{
    return (bh->b_data? bh->b_data: EBH(bh)->b_L2data);
}

When an ll_blk_rw request is made, the following code is used:

    req->rq_seg = buffer_seg(bh);
    req->rq_buffer = buffer_data(bh);

So, what we're talking about is that an I/O request's buffer seg/buffer has been found to be different than the actual buffer's seg/data values... here is how that could happen:

The problem (seems to) affect rq_seg and rq_buffer only, which is strange.

If an I/O request is made while a buffer is in L1 (i.e. after map_buffer is called), then the request buffer address will be L1.

When the req data is wrong, it's always L1 replacing L2.

This means an I/O request is being made after a map_buffer call, but before unmap_buffer is called, which may copy the L1 buffer back, and change the seg/data addresses.

So what we have here is a classic problem of keeping the trying to keep the value of a variable in two different places: one in the request, the other in the buffer head.

Looking at map_buffer, it is not checking whether a buffer is locked. So what is likely happening is that map_buffer ends up being called and the problem buffer is being "copied back" to L2 but meanwhile old values of L1 remain in the request.

That's the root cause I think, but the fix is going to take a little thought. We could try a quick fix where both map and unmap buffer do a wait_on_buffer if the buffer is locked, but this could lead to deadlock issues given the shortage of L1 buffers, and I'm not sure that actually solves the original problem of an L1 buffer address being copied into the request buffer.

The real problem is why map_buffer is being called on a buffer that is scheduled for I/O? Does the writing application call write in smaller than 1K increments? If that's the case, then when the system's buffers are all full, it could be possible that the same dirty buffer that the same application was writing to happens to be in a sync-based I/O, also is calling map_buffer to continue to try to write to the same buffer.

For this last case, you might try waiting for the I/O to complete as the first step in map_buffer:

void map_buffer(struct buffer_head *bh)
{
    ext_buffer_head *ebh = EBH(bh);
    int i;

    if (ebh->b_locked) wait_on_buffer(bh); // <--- add this line

    /* If buffer is already mapped, just increase the refcount and return */
    if (bh->b_data /*|| ebh->b_seg != kernel_ds*/) {
        if (!ebh->b_mapcount)
            debug("REMAP: %d\n", buf_num(bh));
        goto end_map_buffer;
    }
...

I am thinking though the general solution is that we will need to rethink exactly how and when map/unmap buffer changes the buffer address, and exactly which address should be used for I/O. Since all I/O was previously synchronous, we didn't need to worry about changing buffer addresses in the middle of sync.

ghaerr commented 1 year ago

@Mellvik: This is indeed a serious rabbit hole... I'm trying to think through and understand the exact way the buffer system works, using getblk/brelse as well as now in addition map_buffer/unmap_buffer.

In the meantime, I've just recommended a classic mistake above, were a variable is tested that can be set by an interrupt handler, which can lead to a race condition (in map_buffer):

     if (ebh->b_locked) wait_on_buffer(bh); // <--- add this line

this should be changed to:

   wait_on_buffer(bh); // <--- add this line

The question remains, exactly why/how is map_buffer being called on a buffer may be dirty, and that is currently already in an I/O request queue? Nonetheless, I'm hoping that the above changes the test results you're getting.

The issue that is bothering me is that how the kernel could otherwise have otherwise worked, BEFORE map_buffer was added to ELKS (originally because there were no far pointers and we had to have L1 buffers). Something doesn't make sense as to how a buffer might get dirtied without waiting for an I/O complete...

SO, I decided to look into Linux 2.0 and see what, if anything was different. And guess what the comment at the top of sync_buffers looks like: (!!!)

/* Godamity-damn.  Some buffers (bitmaps for filesystems)
   spontaneously dirty themselves without ever brelse being called.
   We will ultimately want to put these in a separate list, but for
   now we search all of the lists for dirty buffers */

static int sync_buffers(kdev_t dev, int wait)
{

What I think this is referring to (other than a much deeper rabbit hole than we've been expecting) has to do with the "Z-maps" that MINIX filesystem is keeping. Long story short, Z-maps (and I-maps) are the MINIX bitmaps for the free inode and free block lists. These maps are kept in buffers which are allocated at mount time and released at unmount; this allows the free maps to be very quickly accessed when writing files, without having to possibly another I/O to get the map bits.

What Linus is saying in his comment is that the MINIX Z-map code isn't calling brelse after dirtying the buffer(s). This means that the standard way of waiting for I/O to complete isn't being handled (brelse calls wait_on_buffer as its first function call). So the Linux 2.0 sync_buffers routine is VERY complicated.

The possibly great news for us is that the MINIX Z-map code DOES call map_buffer and unmap_buffer before access the Z-map bits (this allows the buffers to at least be copied in/out from L2, since they're always allocated and not released as described). Thus, we do have an indication of when the Z-map code is gaining access or releasing a buffer, and this call should be able to be used to sleep the calling task until I/O completes (at least that's what I'm hoping!).

So - I'm hoping that the wait_on_buffer fixes our problem without having to resort to the very long and ugly code that Linux 2.0 has for managing this situation.

What I still don't understand is why the wait_on_buffer is called at buffer release time also (it is always called by getblk when getting a buffer. SO - we may also want to have a wait_on_buffer as the first thing in unmap_buffer also.

The cost of this approach has yet to be determined - that is, we may be sleeping a process more than it has to be - except that the wait_on_buffer call is only used to protect sections from I/O in progress. We shall see!

ghaerr commented 1 year ago

The above may not solve the whole problem - in map_buffer, it is seen that it will try to use an L1 buffer for mapping if the b_mapcount field is zero, without checking if that same buffer might also be in the middle of I/O. Thus, the following change might also need to be added:

        if (!ebmap->b_mapcount && !ebmap->b_locked) { // <--- change this line to add locked test before using L1 buffer
            debug("UNMAP: %d <- %d\n", buf_num(bmap), i);

            /* Unmap/copy L1 to L2 */
            xms_fmemcpyw(ebmap->b_L2data, ebmap->b_ds, bmap->b_data, kernel_ds, BLOCK_SIZE/2);    
            bmap->b_data = (char *)0;
            ebmap->b_seg = ebmap->b_ds;
            break;              /* success */
        }

At this point, I'm just looking hard at how to solve the problem, not necessarily the most efficient way. I am hoping that with the above three fixes in place (in addition to the race-free wait_on_buffer fix), the problem will go away. If that works, then understanding how to most efficiently fix the issues with a minimum of task sleeping might want to be considered.

Finally, if none of this works, perhaps #define DEBUG 1 at the top of buffer.c so that the long list of MAP/UNMAP/REMAP printk's can be checked.

ghaerr commented 1 year ago

        if (!ebmap->b_mapcount && !ebmap->b_locked) { // <--- change this line to add locked test before using L1 buffer
            debug("UNMAP: %d <- %d\n", buf_num(bmap), i);

            /* Unmap/copy L1 to L2 */
            xms_fmemcpyw(ebmap->b_L2data, ebmap->b_ds, bmap->b_data, kernel_ds, BLOCK_SIZE/2);    
            bmap->b_data = (char *)0;
            ebmap->b_seg = ebmap->b_ds;
            break;              /* success */
        }

Finally, if none of this works, perhaps #define DEBUG 1 at the top of buffer.c so that the long list of MAP/UNMAP/REMAP printk's can be checked.

Mellvik commented 1 year ago

Thank you @ghaerr -

Much appreciated. This is a lot of useful thinking towards understanding both the problem and the solution. Test results coming up.

In parallell I'm working on completely understanding exactly what happens - by meticulously following each and every 'movement' in the buffer&req space towards the error. I’m thinking the ultimate fix may depend on (or at least benefit from) such understanding. It’s also a very educating journey - if time consuming and frustrating at times.

The question haunting me is this: How can a buffer that has been allocated, filled, marked dirty and not subject to any other activity on the system become involved in mapping? The problem (the change of buf and seg in the bh) happens before the buffer gets attached to the req (as evidenced by the trace), then gets reset some time after entering the req queue. The req queue is moving slowly so there is ample time to change before it get’s to the point of doing the actual IO. The data may or may not have been copied into a L1 buffer, but fortunately not copied back. Did that L1 buffer get used for anything and if it did, where’s the damage?

Focusing on the origin, why would a buffer with data that never was @ L1 and never should get there, get mapped in the first place? That said, some times a buffer containing metadata IS involved, which certainly has been to L1 level and back. To me this reasoning indicate a possible race condition in which the mapping/remapping is actually supposed to be done on a different buffer. Also notable - the data survives in its L2 buffer - it doesn’t get copied down (we don’t know that for sure) and it doesn’t get copied back up to L2, overwriting the original.

BTW - debugging has become a lot easier with the band aid in place - I don’t have to create new (physical) floppies for every compile, new kernels can be ftp’d in. Even though there may still be bad consequences of the mess-up that haven’t been discovered, it seems the boot floppy is now stable, at least for the simple operations the test case entails.

The hunt continues.

-M

aug. 2023 kl. 02:20 skrev Gregory Haerr @.***>:

The above may not solve the whole problem - in map_buffer, it is seen that it will try to use an L1 buffer for mapping if the b_mapcount field is zero, without checking if that same buffer might also be in the middle of I/O. Thus, the following change might also need to be added:
    if (!ebmap->b_mapcount && !ebmap->b_locked) { // <--- change this line to add locked test before using L1 buffer
        debug("UNMAP: %d <- %d\n", buf_num(bmap), i);

        /* Unmap/copy L1 to L2 */
        xms_fmemcpyw(ebmap->b_L2data, ebmap->b_ds, bmap->b_data, kernel_ds, BLOCK_SIZE/2);    
        bmap->b_data = (char *)0;
        ebmap->b_seg = ebmap->b_ds;
        break;              /* success */
    }
At this point, I'm just looking hard at how to solve the problem, not necessarily the most efficient way. I am hoping that with the above three fixes in place (in addition to the race-free wait_on_buffer fix), the problem will go away. If that works, then understanding how to most efficiently fix the issues with a minimum of task sleeping might want to be considered.

Finally, if none of this works, perhaps #define DEBUG 1 at the top of buffer.c so that the long list of MAP/UNMAP/REMAP printk's can be checked.

— Reply to this email directly, view it on GitHub https://github.com/Mellvik/TLVC/pull/16#issuecomment-1674080312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3WGOHNDWLFGDBENCXTI63XUV3EHANCNFSM6AAAAAAY4GDBE4. You are receiving this because you were mentioned.

ghaerr commented 1 year ago

Focusing on the origin, why would a buffer with data that never was @ L1 and never should get there, get mapped in the first place?

You're talking about the processes that are writing 16k of data, right? Agreed, I don't quite yet understand how file data contents are streaming through L1.

I had thought and originally stated that those file contents would be mapped using the (renamed) block_write or blk_rw routines, but looking again at them, they call getblk but never actually map the buffer to L1, instead using xms_fmemcpy to write data less than 1K bytes directly into/from a buffer.

However, MINIX directory search namei routines all use map_buffer, so any file open or create will cause a directory inode to pass through L1. This could answer why a file was renamed ls from an earlier test. I'm continuing to try to figure out how a data buffer might go through L1.

One big problem for sure is that map_buffer only checks for b_mapcount == 0 before copying an L1 buffer out for use by another process. Since it does not check b_locked, any request header with an associated L1 buffer will end up reading or writing the wrong data if the L1 buffer gets remapped before I/O scheduled on it completes. That's what my above fixes were trying to target.

A long time ago, there was a bug to do with the buffer mapping, which is why I added the MAP/REMAP/UNMAP debug statements. It is possible that bug is still there, I can't recall exactly what was done to fix it. From your comments, it sounds like this still might be a possibility.

The problem (the change of buf and seg in the bh) happens before the buffer gets attached to the req (as evidenced by the trace)

Yes, a buffer is being mapped to L1 prior to an I/O write request on it (likely from sync_buffers)... right? If so and you're running STRACE, perhaps you can see which process and system call is being executed that causes the data buffer to be mapped to L1 in the first place. That should answer the questions you're pondering. Once we know the system call that's causing the problem, we can do more source tracing.

On another note, I was thinking about creating a test harness of some kind to do full async I/O testing. Perhaps converting the SSD-TEST or ramdisk driver, along with a kernel timer for delayed I/O results, which would allow for QEMU testing.

Mellvik commented 1 year ago

Again thanks for the loud thinking, @ghaerr. This is getting us there!

aug. 2023 kl. 19:10 skrev Gregory Haerr @.***>:

Focusing on the origin, why would a buffer with data that never was @ L1 and never should get there, get mapped in the first place?

You're talking about the processes that are writing 16k of data, right? Agreed, I don't quite yet understand how file data contents are streaming through L1.

Turns out it doesn't but I have more tracing to do to completely wrap my head around it. There's still a lot of noise in my traces.

I had thought and originally stated that those file contents would be mapped using the (renamed) block_write or blk_rw routines, but looking again at them, they call getblk but never actually map the buffer to L1, instead using xms_fmemcpy to write data less than 1K bytes directly into/from a buffer.Maches what I (thought I) observed earlier today!

However, MINIX directory search namei routines all use map_buffer, so any file open or create will cause a directory inode to pass through L1. This could answer why a file was renamed ls from an earlier test. I'm continuing to try to figure out how a data buffer might go through L1.

Yes, it seems metadata pass through L1, now: what should end up in the req queue, the L1 or L2 buffer? (Not that it matters for the result, but it's important for understanding the process).

One big problem for sure is that map_buffer only checks for b_mapcount == 0 before copying an L1 buffer out for use by another process. Since it does not check b_locked, any request header with an associated L1 buffer will end up reading or writing the wrong data if the L1 buffer gets remapped before I/O scheduled on it completes. That's what my above fixes were trying to target.

I really need to reread this one a couple of times, to make sure you're saying what I think you're saying (it's getting late), but this one eliminates the problem (!!!!!). Possibly even fixes it, I just have to understand it completely...

A long time ago, there was a bug to do with the buffer mapping, which is why I added the MAP/REMAP/UNMAP debug statements. It is possible that bug is still there, I can't recall exactly what was done to fix it. From your comments, it sounds like this still might be a possibility.

That would be something if we'd torpedoed that one too.

The problem (the change of buf and seg in the bh) happens before the buffer gets attached to the req (as evidenced by the trace)

Yes, a buffer is being mapped to L1 prior to an I/O write request on it (likely from sync_buffers)... right? If so and you're running STRACE, perhaps you can see which process and system call is being executed that causes the data buffer to be mapped to L1 in the first place. That should answer the questions you're pondering. Once we know the system call that's causing the problem, we can do more source tracing.

That's the thing, there are no other processes active than the cp (or ftp)... I'm in the middle of comparing rather voluminous traces, and yes, this is where the treasure is hiding. Btw, why would there be an L1 buffer in the req queue (unless we have a pure L1 config)?

On another note, I was thinking about creating a test harness of some kind to do full async I/O testing. Perhaps converting the SSD-TEST or ramdisk driver, along with a kernel timer for delayed I/O results, which would allow for QEMU testing.

This is a great idea, let's talk more about that!!

-M

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ghaerr commented 1 year ago

Yes, it seems metadata pass through L1, now: what should end up in the req queue, the L1 or L2 buffer? (Not that it matters for the result, but it's important for understanding the process).

What is happening is that everything is seemingly proceeding normally, with dirty metadata in an L1 buffer, when the sync_buffer process is run (by /etc/init). That starts I/O on all dirty buffers, and right now, the kernel has to use the current valid L1 buffer address from the buffer head pointer, not its L2 buffer, since the L2 buffer address doesn't have the valid data (L1 must always override L2, that's its whole purpose). So the the request for I/O from sync_buffers has L1 buffer address set. Time goes on, since the I/O hasn't completed yet, and another process wakes up and it needs metadata and that happens to nudge an unmapped but dirty buffer out of L1. Since map_buffer doesn't check for no I/O in process using !b->locked, the L1 buffer is used and different contents are fmemcpyd into it, meanwhile the I/O request still hasn't started. When it finally does start, the I/O is performed on the previous L1 buffer address that now has the wrong data in it.

My above four source changes should eliminate all of that from happening, but certainly mostly the added check for buffer not locked in the main loop in map_buffer. (map_buffer will then never release even an unmapped buffer if I/O is scheduled; b_locked indicates I/O is scheduled, b_mapcount just indicates whether the L1 buffer can be reused; L1 buffers are kept as long as possible since no point in copying to L2 then back to L1 again every time).

Btw, why would there be an L1 buffer in the req queue (unless we have a pure L1 config)?

I'm thinking that an L1 buffer is in the request queue because its dirty for any reason and sync_buffers (in the final pass with wait=1) called ll_blk_rw on it. That is - everything is OK until the sync is run, and because of async I/O as described above, the sync starts I/O on an L1 buffer without stopping it from being reused!! The L1 buffer then gets reused before the I/O is started, but the req->rq_buffer address isn't changed - BOOM.

When a buffer is copied into L1, it stays there until pressure from other processes forces it to be written back out to L2. The whole problem is that the L1 buffer cache isn't protected against async I/O: the L1 buffer can be replaced at any time before I/O is started only by checking b_mapcount == 0. This is definitely buggy (although not certain this is the only problem, just like wait_on_buffer is definitely buggy, all described above).

Another thought would be to stop sync_buffers from writing out L1 buffers. This would mean a full sync wouldn't done, but we could use this to see whether this eliminates any errors. Change the following in sync_buffers:


        /*
         *      Locked buffers..
         *
         *      If buffer is locked; skip it unless wait is requested
         *      AND pass > 0.
         */
        if (ebh->b_locked) {
            if (!wait) continue;
            continue; // <--- add this line
            wait_on_buffer(bh);
        }

ghaerr commented 1 year ago

@Mellvik:

I was able to convert the existing SSD driver to run async I/O on ELKS, using an I/O completion delay of 100ms to simulate a slow device. Running with NR_REQUEST=4 I then ran a test script which creates a MINIX filesystem on /dev/ssd, then tries to run simultaneous cp and sync commands to try to duplicate your issue, all on QEMU.

While I was not able to get corrupted data, I was able to easily reproduce the problem of the request segment and the buffer head segment changing before I/O started. When this happens, the SSD (or in your case FD) driver will use the L1 buffer address, rather than the EXT memory address for the actual I/O. It would appear that on your system only, sometimes the SAME L1 buffer is reused for other data, while on mine, the L1 data isn't overrun (yet), so no data corruption. I need a script that will duplicate your problem.

Nonetheless - I was able to verify that adding a single line change to map_buffer below is able to stop the above address change from ever occurring:

    if (!ebmap->b_mapcount && !ebmap->b_locked) { // <-- change this line by adding the b_locked check

I would suggest you try this and see what happens. The other suggestions I made don't seem to have much effect (except for wait_on_buffer, which was fixed as suggested above for race conditions).

Here's the test script I used:

set -x
set -e

ramdisk /dev/ssd make 96
mkfs /dev/ssd 96
fsck -lvf /dev/ssd
mount /dev/ssd /mnt
cp /bin/p* /mnt &
sync &
cp /bin/p* /mnt &
sync &
cp /bin/p* /mnt &
sync &
ls -l /mnt
df /dev/ssd
sync
cmp /bin/partype /mnt/partype
cmp /bin/passwd /mnt/passwd
cmp /bin/poweroff /mnt/poweroff
cmp /bin/printenv /mnt/printenv
cmp /bin/ps /mnt/ps
cmp /bin/pwd /mnt/pwd
umount /dev/ssd
fsck -lvf /dev/ssd

The script may fail due to out of memory, in which case the cmp will have to be moved to another script for copy verification.

Here's the revised elks/arch/i86/drivers/block/ssd.c driver in full, to show the kernel timeout callback method used:

/*
 * ELKS Solid State Disk block device driver
 *      Use subdriver for particular SSD device
 *      ssd_test.c - test driver using allocated main memory
 *
 * Rewritten June 2020 Greg Haerr
 * Rewritten for async I/O Aug 2023 Greg Haerr
 */
#define DEBUG 1
#include <linuxmt/config.h>
#include <linuxmt/rd.h>
#include <linuxmt/major.h>
#include <linuxmt/kernel.h>
#include <linuxmt/mm.h>
#include <linuxmt/errno.h>
#include <linuxmt/debug.h>

#define MAJOR_NR SSD_MAJOR
#define SSDDISK
#include "blk.h"
#include "ssd.h"

static sector_t NUM_SECTS = 0;          /* max # sectors on SSD device */

static int ssd_open(struct inode *, struct file *);
static void ssd_release(struct inode *, struct file *);

static struct file_operations ssd_fops = {
    NULL,                       /* lseek */
    block_read,                 /* read */
    block_write,                /* write */
    NULL,                       /* readdir */
    NULL,                       /* select */
    ssddev_ioctl,               /* ioctl */
    ssd_open,                   /* open */
    ssd_release                 /* release */
};

void ssd_init(void)
{
    if (register_blkdev(MAJOR_NR, DEVICE_NAME, &ssd_fops) == 0) {
        blk_dev[MAJOR_NR].request_fn = DEVICE_REQUEST;
        NUM_SECTS = ssddev_init();
    }
    if (NUM_SECTS)
        printk("ssd: %ldK disk\n", NUM_SECTS/2UL);
    else printk("ssd: initialization error\n");
}

static int ssd_open(struct inode *inode, struct file *filp)
{
    debug("SSD: open\n");
    if (!NUM_SECTS)
        return -ENXIO;
    inode->i_size = NUM_SECTS << 9;
    return 0;
}

static void ssd_release(struct inode *inode, struct file *filp)
{
    debug("SSD: release\n");
}

#define IODELAY     (10*HZ/100)  /* 10/100 sec = 100msec */

extern jiff_t ssd_timeout;

/* called by timer interrupt */
void ssd_io_complete(void)
{
    struct request *req;
    int ret;

    ssd_timeout = 0;        /* stop further callbacks */
    req = CURRENT;
    if (!req || req->rq_dev == -1U || req->rq_status != RQ_ACTIVE) {
        printk("ssd_io_complete: NULL request\n");
        return;
    }

    for (;;) {
        char *buf;
        ramdesc_t seg;
        sector_t start;

        req = CURRENT;
        if (!req) return;
        if (req->rq_dev == -1U || req->rq_status != RQ_ACTIVE) {
            printk("ssd_io_complete: bad request\n");
            end_request(0);
            continue;
        }

        struct buffer_head *bh = req->rq_bh;
        if (req->rq_buffer != buffer_data(bh) || req->rq_seg != buffer_seg(bh)) {
           printk("SSD: ***ADDR CHANGED*** req seg:buf %04x:%04x bh seg:buf %04x:%04x\n",
                req->rq_seg, req->rq_buffer, buffer_seg(bh), buffer_data(bh));
        }

        buf = req->rq_buffer;
        seg = req->rq_seg;
        start = req->rq_blocknr * (BLOCK_SIZE / SD_FIXED_SECTOR_SIZE);

        /* all ELKS requests are 1K blocks = 2 sectors */
        if (start >= NUM_SECTS-1) {
            debug("SSD: bad request block %lu\n", start/2);
            end_request(0);
            continue;
        }
        if (req->rq_cmd == WRITE) {
            debug("SSD: writing block %lu\n", start/2);
            ret = ssddev_write_blk(start, buf, seg);
        } else {
            debug("SSD: reading block %lu\n", start/2);
            ret = ssddev_read_blk(start, buf, seg);
        }
        if (ret != 2) {
            end_request(0);         /* I/O error */
            continue;
        }
        end_request(1);             /* success */
        if (CURRENT) {
            ssd_timeout = jiffies + IODELAY;    /* schedule next completion callback */
        }
    }
}

/* called by add_request to start I/O after first request added */
static void do_ssd_request(void)
{
    printk("do_ssd_request\n");
    for (;;) {
        struct request *req = CURRENT;

        if (!req || req->rq_dev == -1U) {
            printk("do_ssd_request: no requests\n");
            return;
        }
        INIT_REQUEST(req);

        if (!NUM_SECTS) {
            end_request(0);
            continue;
        }
        ssd_timeout = jiffies + IODELAY;    /* schedule completion callback */
        return;
    }
}

With the following change in elks/arch/i86/kernel/timer.c:

diff --git a/elks/arch/i86/kernel/timer.c b/elks/arch/i86/kernel/timer.c
index 3bee6520..b806e2c0 100644
--- a/elks/arch/i86/kernel/timer.c
+++ b/elks/arch/i86/kernel/timer.c
@@ -57,6 +57,9 @@ static void calc_cpu_usage(void)
 }
 #endif

+jiff_t ssd_timeout;
+extern void ssd_io_complete();
+
 void timer_tick(int irq, struct pt_regs *regs)
 {
     do_timer(regs);
@@ -69,6 +72,10 @@ void timer_tick(int irq, struct pt_regs *regs)
     rs_pump();         /* check if received serial chars and call wake_up*/
 #endif

+    if (ssd_timeout && jiffies >= ssd_timeout) {
+        ssd_io_complete();
+    }
+

The test driver can be enabled by using CONFIG_BLK_DEV_SSD_TEST=y.

I'm running out of time but should have more information tomorrow. I've added a number of debug printk's to sync_buffer to see exactly what is going on.

Mellvik commented 1 year ago

Thanks @ghaerr,

you may have missed my point about the fix working...

Yes, it seems metadata pass through L1, now: what should end up in the req queue, the L1 or L2 buffer? (Not that it matters for the result, but it's important for understanding the process).

What is happening is that everything is seemingly proceeding normally, with dirty metadata in an L1 buffer, when the sync_buffer process is run (by /etc/init).

Not that it matters for the result, but I'm not using autosync and not bin/sync either. Just letting get_free_buffer() call sync when necessary.

That starts I/O on all dirty buffers, and right now, the kernel has to use the current valid L1 buffer address from the buffer head pointer, not its L2 buffer, since the L2 buffer address doesn't have the valid data (L1 must always override L2, that's its whole purpose). So the the request for I/O from sync_buffers has L1 buffer address set. Time goes on, since the I/O hasn't completed yet, and another process wakes up and it needs metadata and that happens to nudge an unmapped but dirty buffer out of L1. Since map_buffer doesn't check for no I/O in process using !b->locked, the L1 buffer is used and different contents are fmemcpyd into it, meanwhile the I/O request still hasn't started. When it finally does start, the I/O is performed on the previous L1 buffer address that now has the wrong data in it.

My above four source changes should eliminate all of that from happening, but certainly mostly the added check for buffer not locked in the main loop in map_buffer. (map_buffer will then never release even an unmapped buffer if I/O is scheduled; b_locked indicates I/O is scheduled, b_mapcount just indicates whether the L1 buffer can be reused; L1 buffers are kept as long as possible since no point in copying to L2 then back to L1 again every time).

Btw, why would there be an L1 buffer in the req queue (unless we have a pure L1 config)?

I'm thinking that an L1 buffer is in the request queue because its dirty for any reason and sync_buffers (in the final pass with wait=1) called ll_blk_rw on it. That is - everything is OK until the sync is run, and because of async I/O as described above, the sync starts I/O on an L1 buffer without stopping it from being reused!! The L1 buffer then gets reused before the I/O is started, but the req->rq_buffer address isn't changed - BOOM.

When a buffer is copied into L1, it stays there until pressure from other processes forces it to be written back out to L2. The whole problem is that the L1 buffer cache isn't protected against async I/O: the L1 buffer can be replaced at any time before I/O is started only by checking b_mapcount == 0. This is definitely buggy (although not certain this is the only problem, just like wait_on_buffer is definitely buggy, all described above).

This is spot on, and as pointed out in my previous message, these two changes (actually the wait_on_buffer addition doesn't do anything in this case but it's definitely a bug as you point out) eliminate what I have now come top call the 'buffer pollution problem'.

The quest to understand exactly what's going on continues though (stubborn? yes). I'm observing L1 buffers in the REQ queue with metadata (they've worked just fine all along) and then the erroneous blocks where the L1 and L2 data are unrelated and cause the error.

What I'm trying to do is to ascertain that what's happening is in fact the scenario you describe above - the lacking b_locked test (possibly the wait) and nothing else - just to make sure no stones are left unturned no that we're in the rabbit hole anyway.

-M

Mellvik commented 1 year ago

This is some really great work, @ghaerr.

I was able to convert the existing SSD driver to run async I/O on ELKS, using an I/O completion delay of 100ms to simulate a slow device. Running with NR_REQUEST=4 I then ran a test script which creates a MINIX filesystem on /dev/ssd, then tries to run simultaneous cp and sync commands to try to duplicate your issue, all on QEMU.

Actually you may be able to get corrupted data if you skip the manual syncs and let the buffer cache fill up, then autosync. The interplay between the out-of-buffers situation, the rapid fill of the REQ queue (which I've kept at 20 in order to get a more detailed spread [and adding more noise of course]) and the oscillations commencing when one freed REQ immediately gets reused etc.

It may also simplify debugging if you don't have several processes to keep track of - in my case they're not required to trigger the problem.

While I was not able to get corrupted data, I was able to easily reproduce the problem of the request segment and the buffer head segment changing before I/O started. When this happens, the SSD (or in your case FD) driver will use the L1 buffer address, rather than the EXT memory address for the actual I/O. It would appear that on your system only, sometimes the SAME L1 buffer is reused for other data, while on mine, the L1 data isn't overrun (yet), so no data corruption. I need a script that will duplicate your problem.

Nonetheless - I was able to verify that adding a single line change to map_buffer below is able to stop the above address change from ever occurring:
    if (!ebmap->b_mapcount && !ebmap->b_locked) { // <-- change this line by adding the b_locked check
I would suggest you try this and see what happens. The other suggestions I made don't seem to have much effect (except for wait_on_buffer, which was fixed as suggested above for race conditions).

This is my experience too. The b_locked test eliminate the problem. BTW I have also temporarily irq-protected the entire put_last_lru routine for good measure, but that may not be useful.

The very simple script I've been using goes like this:

#!/bin/sh
mount /dev/dhda4 /mnt
cd /mnt/tmp
# xx is the first 40k of a /linux file
cp xx /root
cmp xx /root/xx

And a run:

TLVC 0.6.0

login: root
tlvc15# ./test1.sh
Files differ at byte position 17409
tlvc15#

The HD partition is FAT, but that does not matter. As pointed out before the source may as well be FTP. Root and boot is FD.

Anyway - AFAIK we do have a solution, the problem has been located and fixed, and unless the final cleanup reveals more 'leads', another one has bitten the dust.

Thanks for the help - and BTW I'll copy the timer and SSD changes to TLVC, and use the idea in the directFD driver to see if I can get the problems repeated in QEMU! That would be very helpful for future debugging. The directFD raw IO implementation is just around the corner.

-- M

ghaerr commented 1 year ago

Hello @Mellvik!

you may have missed my point about the fix working...

Yes, I did LOL!. I think the reason is that sometimes GitHub doesn't always format email-based responses well. Looks like you've corrected that though :)

AFAIK we do have a solution, the problem has been located and fixed,

Wow, very nice indeed!! Another big kernel problem in ELKS and TLVC fixed!!! Thank you for your very detailed debugging and identification of the specifics. I must say diving deep into the Linux buffer system was really helpful in fully understanding how block I/O really works, and enabled writing an async SSD driver that amazingly enough worked the first time :)

I plan on using your script as well as cataloging a few others to fully exercise the block I/O system for regression testing. Given the new understanding of exactly how buffers and async I/O works, I am now also considering how one might be able to mix async I/O with synchronous (BIOS) I/O, particularly for the case of a BIOS HD driver. The idea would be that while an async FD driver quickly improves system responsiveness (and many FD controllers are "IBM PC" compatible), a non-BIOS HD driver has many more compatibility issues due to the type of drive connected. I am thinking of a way that could allow a BIOS HD driver with async I/O for everything else, but that's off subject and discussion can be continued elsewhere...

The quest to understand exactly what's going on continues though (stubborn? yes).

What I'm trying to do is to ascertain that what's happening is in fact the scenario you describe above - the lacking b_locked test (possibly the wait) and nothing else

Totally agree. And now I realize we may in fact have one more lurking issue, now that the kernel buffer management is more understood: In a way, the L1 and L2 buffer caches could be thought of as hierarchical caches, the same way a CPU uses them. That is, this most recent bug was basically that when the L1 cache was in use, then get_free_buffer (or sync) caused a buffer shuffle/cache flush, and the L1 buffer became reused, but the I/O request pointers weren't updated, because the buffer wasn't locked. HOWEVER - I am thinking that when get_free_buffer shuffles the buffer system looking for another available buffer (which could kick out an old buffer and REUSE a buffer head to contain a completely different block number - the L1 cache (which is an array of buffer head pointers that are in L1) is not being explicitly invalidated. We need to make sure that any mapped L1 buffer doesn't get shuffled out (even/especially in the non-I/O-in-progress case) when more buffers are needed. get_free_buffer does check b_data, which is almost the same as b_mapcount != 0, but I intended to verify this is always the case. Bottom line is that it needs to be ensured that get_free_buffer won't reuse an L1 buffer that other processes may be using but asleep.

ghaerr commented 1 year ago

The HD partition is FAT, but that does not matter.

As mentioned previously, the FAT filesystem uses static stack variables to allow for a lower kernel stack size. This is buggy for async I/O and will eventually cause problems for any task that happens to sleep while any function using static is in its kernel stack frame. FYI, this all these functions were identified and fixed in ELKS in PR 1641. (ELKS is currently being maintained with an optional CONFIG_ASYNCIO setting to allow for system stability while async I/O is being figured out).

That same PR determined the increased stack usage for FAT when using non-static buffers, and it amounts to approx 100 bytes. The max I've seen, measured with the new kstack mechanism, is 598 bytes kernel stack. The current max stack is 640, so we're getting very close, but still usable. Using strace, it appears max stack usage is during the stat system call.

ghaerr commented 1 year ago

@Mellvik:

Just for kicks (and I'm also stubborn), I took your test script, commented out the L1 fix in map_buffer, and ran it using the async ssd driver. While it does show the ***ADDR CHANGED*** error, I am not seeing any data corruption using cmp, FYI.

I am running NR_REQUEST=4, no XMS, EXTBUFFERS=64. I'll try changing NR_REQUEST=20 to try to match your situation. It would be really nice to get a data corruption case using QEMU, for regression testing. I am running MINIX, not FAT for the test, so that could be a very big difference, since FAT uses map_buffer quite differently.

If you end up doing more testing yourself, it might be interesting to perhaps change the IODELAY in the ssd driver to better match the FD delay (I have no idea how long a floppy takes to read a sector lol), to see if that allows us to get a reproducible error condition on QEMU. Having a reproducible error on QEMU will help us fully know whether this buffer problem is fixed, given the possible remaining L1 invalidate issue described above.

Thank you!

Mellvik commented 1 year ago

@ghaerr, there is indeed a lot of good news here, and I'll be back with more comments.

Just a quick one: reduce the buffers to 32, the # of buffers MUST be lower than the size of the file you're testing with.

-M

aug. 2023 kl. 18:16 skrev Gregory Haerr @.>: @Mellvik: Just for kicks (and I'm also stubborn), I took your test script, commented out the L1 fix in map_buffer, and ran it using the async ssd driver. While it does show the ADDR CHANGED*** error, I am not seeing any data corruption using cmp, FYI. I am running NR_REQUEST=4, no XMS, EXTBUFFERS=64. I'll try changing NR_REQUEST=20 to try to match your situation. It would be really nice to get a data corruption case using QEMU, for regression testing. I am running MINIX, not FAT for the test, so that could be a very big difference, since FAT uses map_buffer quite differently.

If you end up doing more testing yourself, it might be interesting to perhaps change the IODELAY in the ssd driver to better match the FD delay (I have no idea how long a floppy takes to read a sector lol), to see if that allows us to get a reproducible error condition on QEMU. Having a reproducible error on QEMU will help us fully know whether this buffer problem is fixed, given the possible remaining L1 invalidate issue described above.

Thank you!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Mellvik commented 1 year ago

Also, like I've mentioned before, fat or not makes no difference. I've followed the stack carefully and I'm using BIG fat directories. No problem.

And (yeah, I know I'm repeating myself), I've been using ftp more often than FAT, it's the same. So minix source is fine. BTW This is on my venix machine, it takes the other 3 partitions. --M

aug. 2023 kl. 18:16 skrev Gregory Haerr @.>: @Mellvik: Just for kicks (and I'm also stubborn), I took your test script, commented out the L1 fix in map_buffer, and ran it using the async ssd driver. While it does show the ADDR CHANGED*** error, I am not seeing any data corruption using cmp, FYI. I am running NR_REQUEST=4, no XMS, EXTBUFFERS=64. I'll try changing NR_REQUEST=20 to try to match your situation. It would be really nice to get a data corruption case using QEMU, for regression testing. I am running MINIX, not FAT for the test, so that could be a very big difference, since FAT uses map_buffer quite differently.

If you end up doing more testing yourself, it might be interesting to perhaps change the IODELAY in the ssd driver to better match the FD delay (I have no idea how long a floppy takes to read a sector lol), to see if that allows us to get a reproducible error condition on QEMU. Having a reproducible error on QEMU will help us fully know whether this buffer problem is fixed, given the possible remaining L1 invalidate issue described above. Thank you!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

ghaerr commented 1 year ago

reduce the buffers to 32, the # of buffers MUST be lower than the size of the file you're testing with.

That worked, the async SSD driver on QEMU now errors with my script (/bin/partype copied and 2nd block is bad without fix). With fix, everything OK. However, ***ADDR CHANGED*** still being shown by SSD driver with fix, indicating the request seg/buffer is still different than the buffer head seg/buffer address at times. Investigating further, even though no error.

[EDIT: Actually I had to set the EXT_BUFFERS=28 to get the above. Using 32 gets an error and trashes /dev/ssd, resulting in a bit harder-to trace problem since I think one of the /dev/ssd MINIX Z/I maps got zapped.]

(BTW, another cool thing about testing with the async SSD driver is that the filesystem on /dev/ssd is re-created each time the system is booted, no need to rebuild a messed-up MINIX/FAT image). I'm adding ramdisk /dev/ssd kill at the start of the script, but that errors when there is no /dev/ssd inited, I'm fixing that right now as well.

Thanks!

ghaerr commented 1 year ago

However, ***ADDR CHANGED*** still being shown by SSD driver with fix, indicating the request seg/buffer is still different than the buffer head seg/buffer address at times.

OK, figured out the (hopefully) last reason the request and buffer head segment/addresses may become unpaired, using the SSD driver and script: there needs to be a wait_on_buffer at the beginning of map_buffer.

The way I started thinking about all this is to think in terms of buffers coming "in" to the L1 cache, and those going "out" (both of which switch buffer addresses), during a possible I/O in progress. Our existing fix in the middle of map_buffer testing for !bh_locked stops any buffer from going "out" when there is unstarted/uncompleted I/O scheduled, and the new wait_for_buffer at the entry of the function stops any buffer from being switched "in" to L1 if there might be I/O in progress. Both cause the process to wait until and I/O complete before trying again. In this way, the request and buffer header segment/buffer addresses won't change, but remained "paired".

The reason that things are getting complicated with sync_buffer and the ELKS/TLVC kernel is that, in most cases, the execution path through the kernel is pretty straightforward, from a process' viewpoint - they make a system call, possibly sleep for I/O required for their function, then continue. But when sync_buffers is called (and insidiously through get_free_buffer), all dirty buffers are marked for I/O, and the calling process may end up potentially even busy-looping inside get_free_buffer and sync_buffers. That whole process unleashes a large number of I/O completes which allow other processes to go forward, all in the middle of single innocent system call, even though the kernel itself is carefully not reentrant. Wow!

We need both fixes in order to eliminate the ADDR CHANGED error - for the two different reasons. I'm going to continue testing with lots of debug output and see whether this fully fixes this particular buffer problem.

I plan on posting fixes for all this after some more testing, but here's the script I'm using for SSD that should enable you to see the ***ADDR CHANGED*** message until things are fully working.

set -x
set -e

#ramdisk /dev/ssd kill
ramdisk /dev/ssd make 96
mkfs /dev/ssd 96
fsck -lvf /dev/ssd

mount /dev/ssd /mnt
cp /bin/p* /mnt &
sync &
cp /bin/p* /mnt &
sync &
#cp /bin/p* /mnt &
#sync &
#ls -l /mnt
df /dev/ssd
sync
cmp /bin/partype /mnt/partype
cmp /bin/passwd /mnt/passwd
cmp /bin/poweroff /mnt/poweroff
cmp /bin/printenv /mnt/printenv
cmp /bin/ps /mnt/ps
cmp /bin/pwd /mnt/pwd
umount /dev/ssd

fsck -lvf /dev/ssd
echo "TEST COMPLETE"

ghaerr commented 1 year ago

Hello @Mellvik,

(Final post for today...) Unfortunately, after not being able to get the system to fail in any way by running the above script, I removed all messages by converting printk to debug_blk, and guess what? The script failed with data corruption, in the second block of /bin/partype. Arggh!!! I can only imagine how many times you've been through this before.

I then tried many times to get a failure, and it all worked (with no messages). Almost in disbelief that I ever even saw the failure in the first place, I finally changed EXT_BUFFERS=20 and now it fails every time, with both debug messages on or off. This time, though, the SSD driver never shows the ***ADDR CHANGED message, its just corrupt data. Two down, one to go.

[EDIT: Update - suddenly, for some reason, without changing anything, I can't get QEMU to reproduce the failure, at all. This is a nasty timing related issue for sure!].

So, while I think we accomplished something, we're not there yet. I am beginning to think that, although the above fixes were absolutely necessary to prevent request and buffer header address unpairing, they ultimately are hiding the deeper real problem, which I think is map_buffer/unmap_buffer. Since the scripts are also running fsck /dev/ssd and sometimes the MINIX fs is corrupted, I believe that somehow metadata that is passing through L1 using map_buffer and the buffer data is somehow getting mismanaged and buffer data is being assigned to the wrong buffer head (or L1 cache). In other words, its the same problem you originally described.

I'm also guessing that this problem may be related to an issue I saw years ago in map_buffer, when the MAP/REMAP/UNMAP code was added, and apparently a fix was made that seemingly put it to bed. But async I/O has awakened it, I believe it's likely not actually an async problem, just a buffer a problem that occurs frequently when there's lots of buffer activity.

I was going to post a PR over at ELKS with all of this for review, but since it's not actually fixed... The Hunt Continues!!!!

ghaerr commented 1 year ago

(Final post #2 for the day...)

After chasing down the final problem from the last post, it was finally found that the test script performed two background cp copy commands by launching them via &. What was happening is that in certain cases, the 2nd cp command hadn't finished by the time sync was executed and was possibly still in the middle of creating/truncating/copying files when the cmp commands started and failed. Adding a wait command before or after sync and the problem completely went away.

I then modified the script to perform a synchronous cp without wait and the problem is still gone. In fact, I am finally (again) completely unable to produce any errors - so I think the buffer system is in fact working correctly with the fixes explained above, after all.

I've gone ahead and posted PR 1649 at ELKS to show the changes, as a work in progress. Let's hope that the issues tracked down here are in fact solved!!

BTW, with regards to QEMU acting differently sometimes on startup, I have found what I think is a bad power-related optimization - it sometimes stops firing the hardware timer tick - and the emulated OS just sits there without doing anything. Typing on another console usually starts it up again. When this problem is happening, the date is not advancing either. So I am attributing some of the irregularities in the long regression testing results for today as associated with QEMU's sometimes apparently slowing down or turning off the HW timer (possibly to save power?).

Mellvik commented 1 year ago

Wow, there's a lot to digest here, @ghaerr - and a lot of great debugging delivered.

However, ***ADDR CHANGED*** still being shown by SSD driver with fix, indicating the request seg/buffer is still different than the buffer head seg/buffer address at times.

OK, figured out the (hopefully) last reason the request and buffer head segment/addresses may become unpaired, using the SSD driver and script: there needs to be a wait_on_buffer at the beginning of map_buffer.

Interesting - because this is actually the first fix you suggested far up in this thread. I've had it in the code since, taken it out and back in a couple of times and not seen any differences, but its usefulness seemed obvious. And with a new set of reasons right here, it's definitely a stayer :-)

The way I started thinking about all this is to think in terms of buffers coming "in" to the L1 cache, and those going "out" (both of which switch buffer addresses), during a possible I/O in progress. Our existing fix in the middle of map_buffer testing for !bh_locked stops any buffer from going "out" when there is unstarted/uncompleted I/O scheduled, and the new wait_for_buffer at the entry of the function stops any buffer from being switched "in" to L1 if there might be I/O in progress. Both cause the process to wait until and I/O complete before trying again. In this way, the request and buffer header segment/buffer addresses won't change, but remained "paired".

The reason that things are getting complicated with sync_buffer and the ELKS/TLVC kernel is that, in most cases, the execution path through the kernel is pretty straightforward, from a process' viewpoint - they make a system call, possibly sleep for I/O required for their function, then continue. But when sync_buffers is called (and insidiously through get_free_buffer), all dirty buffers are marked for I/O, and the calling process may end up potentially even busy-looping inside get_free_buffer and sync_buffers. That whole process unleashes a large number of I/O completes which allow other processes to go forward, all in the middle of single innocent system call, even though the kernel itself is carefully not reentrant. Wow!

Indeed - and my thoughts exactly as I recorded the sequence of events required to trigger the problem. At first an occasional file screwup (new kernel being wrong and not booting) or the floppy fs got screwed after the first write. Then new kernel transfers via ftp turned out wrong almost every time. Then the current setting which appeared when Stracing was turned on. And in between, zillions of stupid and a few smart tests/changes/whatever.

We need both fixes in order to eliminate the ADDR CHANGED error - for the two different reasons. I'm going to continue testing with lots of debug output and see whether this fully fixes this particular buffer problem.

I think - even after having been through your next two messages - that we need to find out why data buffers some times enter the req queue with L1 seg/buf addresses. More about that later.

--M

Mellvik commented 1 year ago

Some thoughts, @ghaerr:

I plan on using your script as well as cataloging a few others to fully exercise the block I/O system for regression testing. Given the new understanding of exactly how buffers and async I/O works, I am now also considering how one might be able to mix async I/O with synchronous (BIOS) I/O, particularly for the case of a BIOS HD driver. The idea would be that while an async FD driver quickly improves system responsiveness (and many FD controllers are "IBM PC" compatible), a non-BIOS HD driver has many more compatibility issues due to the type of drive connected. I am thinking of a way that could allow a BIOS HD driver with async I/O for everything else, but that's off subject and discussion can be continued elsewhere...

This is what I had going when developing the direct FD driver. the two - BIOS HD and direct FD - work well together (with the caveat about when XMS buffers may be used or not that is mentioned on the code and developer notes) and the system can be menconfig-ured for the combination.

About the driver combinations and compatibility, I disagree. For FD: You're right, every PC since the 5150 has the NEC765 chip, use hardwired DMA channel and IRQ, and they are likely to work with the direct driver - with the exception of 2.88M drives which aren't supported and frankly, are very rare.

HD: The IDE support introduced with the AT makes every PC compatible with just about any drive from that on. Some adjustments in the driver when new scenarios are hit (as I experienced with the old 20M drive), but even that will almost go away when I introduce interrupts in the driver, eliminating dependency on more or less exact delays (or generally very long ones to be on the safe side). So BIOS independence is easy with the exception of the CHS values - which may have to be managed manually if maintaining compatibility with BIOS bases OSes is important.

For older (pre IDE) PCs, using the BIOS is mandatory, there is a lot of magic to get those old beasts to work. That said, the chances of hitting drives that old that actually work, is minute. Further, these systems always have ISA based controllers (as in 'not on the motherboard'), and the general case would be an XT-IDE controller, which will work with the new direct HD driver.

The quest to understand exactly what's going on continues though (stubborn? yes).

What I'm trying to do is to ascertain that what's happening is in fact the scenario you describe above - the lacking b_locked test (possibly the wait) and nothing else

Totally agree. And now I realize we may in fact have one more lurking issue, now that the kernel buffer management is more understood: In a way, the L1 and L2 buffer caches could be thought of as hierarchical caches, the same way a CPU uses them. That is, this most recent bug was basically that when the L1 cache was in use, then get_free_buffer (or sync) caused a buffer shuffle/cache flush, and the L1 buffer became reused, but the I/O request pointers weren't updated, because the buffer wasn't locked. HOWEVER - I am thinking that when get_free_buffer shuffles the buffer system looking for another available buffer (which could kick out an old buffer and REUSE a buffer head to contain a completely different block number - the L1 cache (which is an array of buffer head pointers that are in L1) is not being explicitly invalidated.

This is an interesting observation - which MAY explain why L2 buffers containing data and no need for mapping at all, have L1 buffers attached.

We need to make sure that any mapped L1 buffer doesn't get shuffled out (even/especially in the non-I/O-in-progress case) when more buffers are needed. get_free_buffer does check b_data, which is almost the same as b_mapcount != 0, but I intended to verify this is always the case. Bottom line is that it needs to be ensured that get_free_buffer won't reuse an L1 buffer that other processes may be using but asleep.

I need to go back and compare the behavior of the test case - pre and post the fix, to map out the exact difference.

--M

The HD partition is FAT, but that does not matter.

As mentioned previously, the FAT filesystem uses static stack variables to allow for a lower kernel stack size. This is buggy for async I/O and will eventually cause problems for any task that happens to sleep while any function using static is in its kernel stack frame. FYI, this all these functions were identified and fixed in ELKS in PR 1641. (ELKS is currently being maintained with an optional CONFIG_ASYNCIO setting to allow for system stability while async I/O is being figured out).

That same PR determined the increased stack usage for FAT when using non-static buffers, and it amounts to approx 100 bytes. The max I've seen, measured with the new kstack mechanism, is 598 bytes kernel stack. The current max stack is 640, so we're getting very close, but still usable. Using strace, it appears max stack usage is during the stat system call.

Thanks. I'm aware of this one since you mentioned it a while back (which was the reason STRACE was turned on and 'changed everything'). I removed most (probably all) of the statics in the FAT code and never got above 519 on the stack with my usage. Adding the ASYNCIO_REENTRANT is a good idea, I'll copy that.

--M

ghaerr commented 1 year ago

About the driver combinations and compatibility, I disagree.

Thanks for your comments @Mellvik. You're much more familiar with PC hardware than I am - perhaps the async HD driver will work for a majority of systems, which would be great. I am curious how you're going to change that from polling to interrupt-driven. Can all the IDE controllers send an interrupt on I/O complete? I'm also curious as to your thoughts on "gather-read/gather-write" capability, where the controller can take a separate memory address for each sector in a multi-sector read... that would allow for some possibly big speed ups and cool kernel implementations of updated buffer code.

we need to find out why data buffers some times enter the req queue with L1 seg/buf addresses.

You're running FAT for these tests, right? I just found the reason: msdos_file_read (and write) call msdos_sread for file I/O, and msdos_sread calls map_buffer. It appears to do this only for getting a near address for use with memcpy_tofs/memcpy_fromfs. So ALL FAT file I/IO is going through L1!!! No wonder FAT is so slow!! There may be a way to re-code all of this using xms_fmemcpy, like the MINIX file read/write does.

So, it might be a good idea to try to run your test scripts on MINIX to see if L1 data is seen (which I'll bet won't be).

Mellvik commented 1 year ago

About the driver combinations and compatibility, I disagree.

Thanks for your comments @Mellvik. You're much more familiar with PC hardware than I am - perhaps the async HD driver will work for a majority of systems, which would be great.

It should - all systems with an IDE connector. I'm collecting IDE ID profiles from as many drives as I can get my hands on to learn more (that's the ide_dump routine in the driver), but having mastered the old 40M drive I think direct the TLVC HD coverage is good.

I am curious how you're going to change that from polling to interrupt-driven. Can all the IDE controllers send an interrupt on I/O complete?

When interrupts are enabled, IDE drives/controllers will (generally) interrupt on command complete, even simple commands like 'get drive status' and 'set drive parameters'. That means removing the delays and waitforreadybit loops, which are very short on CF cards and (relatively) very long on a 40M drive. The driver mechanism will be akin to the floppy driver, but much simpler. Still PIO, no DMA.

I'm also curious as to your thoughts on "gather-read/gather-write" capability, where the controller can take a separate memory address for each sector in a multi-sector read... that would allow for some possibly big speed ups and cool kernel implementations of updated buffer code.

I'm not going to touch that. The old drives (pre 1995, maybe even later) don't support it and those who do are likely to be faster than the system itself, so I think the speed improvement potential is very limited. Keeping the driver really simple (and lean) is the goal.

we need to find out why data buffers some times enter the req queue with L1 seg/buf addresses.

You're running FAT for these tests, right? I just found the reason: msdos_file_read (and write) call msdos_sread for file I/O, and msdos_sread calls map_buffer. It appears to do this only for getting a near address for use with memcpy_tofs/memcpy_fromfs. So ALL FAT file I/IO is going through L1!!! No wonder FAT is so slow!! There may be a way to re-code all of this using xms_fmemcpy, like the MINIX file read/write does.

So, it might be a good idea to try to run your test scripts on MINIX to see if L1 data is seen (which I'll bet won't be).

That's an important discovery indeed! Thanks. I'll cave in and botch one of the Venix partitions after all to check it out. The weird thing about the current scenario (which is the same with and without the fixes) is the L1/L2 mix. When added to the req queue, some block headers point to L1, some to L2, like this (printks from make_request(the seg stuff) and add_request(the AD/AQ stuff) plus ER as in end_request):

DEBUG: no free bufs, syncing <--- from get_free-buffer, which triggers sync_buffers(),
                                                           which hits the req queue
|seg13db|blk3;ADd604|   <---------bitmap, OK
|seg2400|blk1139;AQd5f2|
|seg2400|blk1140;AQd5e0|
|seg2400|blk1141;AQd5ce|
|seg2400|blk1142;AQd5bc|
|seg2400|blk1143;AQd5aa|
|seg2400|blk1144;AQd598|
|seg2400|blk1145;AQd586|
|seg13db|blk1146;AQd574|  <---- indirect block, OK
|seg2400|blk1147;AQd562|
|seg2400|blk1148;AQd550|
|seg2400|blk1149;AQd53e|
|seg2400|blk1150;AQd52c|ERd604|0200 <--- at this point the end_requests start pouring in,
|seg2400|blk1151;AQd604|ERd5f2|0200   <--- matching req addresses from the AD/AQ entries above
|seg2400|blk1152;AQd5f2|ERd5e0|0200
|seg2400|blk1153;AQd5e0|ERd5ce|0200
|seg2400|blk1154;AQd5ce|ERd5bc|0200
|seg2400|blk1155;AQd5bc|ERd5aa|0200
|seg2400|blk1156;AQd5aa|ERd598|0200
|seg2400|blk1157;AQd598|ERd586|0200
|seg13db|blk1158;AQd586|ERd574|0200 <--- just another data block, why the L1 reference suddenly?
|seg13db|blk1159;AQd574|ERd562|0200
|seg13db|blk1160;AQd562|ERd550|0200
|seg13db|blk1161;AQd550|
...  (at this point, the freed buffers start filling up again, entailing activity from the disk drive)

The first L1 (seg13db) is OK, it's metadata, the second is also OK, likely the indirect block, the 4-in-a-row later are pure data (and this is where the errors normally would appear).

Looking forward to see what copy from Minix looks like - with your discovery in mind.

Thanks! --M

ghaerr commented 1 year ago

Hello @Mellvik,

Interesting layered debugging printk output, I'll have to take a closer look at how you're doing that. I'm using plain voluminous printk to the serial console, but at least one can scroll backwards to find everything. All my testing is currently on SSD with lots of debug added to the driver.

A further update on my testing: with the fixes we've talked about in (and no background cps), I'm not seeing any errors or corruption at all, but the test script has no background file copying. However, when cping in the background to the same file, there is data corruption. I've noted that as a potential completely separate kernel filesystem bug, to be dealt with at another time. BUT, when I change the test script to copy multiple different files at the same time, I am seeing some data corruption on occasion. I don't know if this is a the same or different buffer problem or another filesystem bug. I'm using MINIX only (on SSD) to keep things simple. So there is still a big problem when multiple processes are writing data, IMO.

We need more test scripts for regression testing! :)

Here's my test script that occasionally shows problems:

set -x
set -e

ramdisk /dev/ssd kill
ramdisk /dev/ssd make 96
mkfs /dev/ssd 96
fsck -lvf /dev/ssd

mount /dev/ssd /mnt
cp /bin/p* /mnt &
sync &
cp /bin/l* /mnt &
sync &
df /dev/ssd
sync
wait
cmp /bin/partype /mnt/partype
cmp /bin/passwd /mnt/passwd
cmp /bin/poweroff /mnt/poweroff
cmp /bin/printenv /mnt/printenv
cmp /bin/ps /mnt/ps
cmp /bin/pwd /mnt/pwd
cmp /bin/ln /mnt/ln
cmp /bin/ls /mnt/ls
cmp /bin/login /mnt/login

umount /dev/ssd

fsck -lvf /dev/ssd
echo "TEST COMPLETE"

After running the script on serial console many times, I finally got:

+ cmp /bin/passwd /mnt/passwd
Files differ at byte position 3073

I really don't know where to start debugging on this, so thought to ask you what you're thinking and seeing. I haven't yet determined whether removing the async sync might help or not.

There is one other bug I'm investigating, and that is I occasionally see the following:

ssd_io_complete: NULL request

This is from a new addition to my SSD driver I/O complete routine, where the CURRENT request is checked immediately upon entry:

/* called by timer interrupt */
void ssd_io_complete(void)
{
    struct request *req;
    int ret;

    ssd_timeout = 0;        /* stop further callbacks */

#ifdef CHECK_BLOCKIO
    req = CURRENT;
    if (!req) {
        printk("ssd_io_complete: NULL request\n");
        return;
    }
#endif

I just added the above code on a whim: there should never be a case where an I/O complete finds itself with no currently valid request. (end_request set CURRENT to req->rq_next, which would be NULL if its the last I/O in the queue, and that's called only by this function, not before it). On the SSD driver, seeing NULL would mean that the current I/O request is dropped, because the I/O occurs at interrupt time. For the FD driver, such a thing would indicate that the next I/O request would be dropped, since I/O occurred prior to the interrupt.

I have been studying the add_request and end_request code hard, and I think there may be a race condition where a request gets silently dropped, which this seems to indicate. I'm wondering whether you've seen anything like this in the data file corruption. What it could look like (if occurring in the middle of a file) would be junk disk data from the dropped file I/O request, followed by correct data. The inode would be correct. Or it could manifest where the fsck fails if the dropped I/O is in the filesystem inode area. More investigation happening.

Finally - I will have a fix shortly for the FAT filesystem running data through L1. This should be an easy fix. I may be able to get most of the metadata out of L1 as well. I'll keep you posted on that.

Mellvik commented 1 year ago

You've been busy @ghaerr - good work! And I believe you're right - there is still some buffer related bug lurking around. I'm getting some unexplainable errors when moving between machines. To be investigated.

Hello @Mellvik,

Interesting layered debugging printk output, I'll have to take a closer look at how you're doing that. I'm using plain voluminous printk to the serial console, but at least one can scroll backwards to find everything. All my testing is currently on SSD with lots of debug added to the driver.

Not much magic to my debug output really, I cleaned it up a bit for readability, that's all. Also, I'm constantly minimizing output, all down to the character level because the printks affect the sequence of things in the kernel, and certainly affect timing a lot, very noticeable on real serial (mostly 57600 bps). Also I make sure they're searchable (I use macos terminal search a lot to trace events). To me, using bufferhead addresses and req addresses (instead of numbers 1 thru x) is key.

A further update on my testing: with the fixes we've talked about in (and no background cps), I'm not seeing any errors or corruption at all, but the test script has no background file copying. However, when cping in the background to the same file, there is data corruption. I've noted that as a potential completely separate kernel filesystem bug, to be dealt with at another time. BUT, when I change the test script to copy multiple different files at the same time, I am seeing some data corruption on occasion. I don't know if this is a the same or different buffer problem or another filesystem bug. I'm using MINIX only (on SSD) to keep things simple. So there is still a big problem when multiple processes are writing data, IMO.

As indicated above, I think this is a different but related issue. It may be worthwhile to skip the syncs and let the buffer reservoir fill up completely, then trigger a sync by itself. And also to experiment with the number of buffers available.

After running the script on serial console many times, I finally got:
+ cmp /bin/passwd /mnt/passwd
Files differ at byte position 3073
I really don't know where to start debugging on this, so thought to ask you what you're thinking and seeing. I haven't yet determined whether removing the async sync might help or not.

Until you get a predictable test case, this is like the wild goose chase I had until enabling strace finally stabilized my situation. Try that - not because you need the info, but to change the scheduling. Also, It may be an idea to add an IOCTL to the ssd driver to enable changing the delay (or switch to real hardware... to see what happens...)

There is one other bug I'm investigating, and that is I occasionally see the following:
ssd_io_complete: NULL request
This is from a new addition to my SSD driver I/O complete routine, where the CURRENT request is checked immediately upon entry:
/* called by timer interrupt */
void ssd_io_complete(void)
{
    struct request *req;
    int ret;

    ssd_timeout = 0;        /* stop further callbacks */

#ifdef CHECK_BLOCKIO
    req = CURRENT;
    if (!req) {
        printk("ssd_io_complete: NULL request\n");
        return;
    }
#endif
I just added the above code on a whim: there should never be a case where an I/O complete finds itself with no currently valid request. (end_request set CURRENT to req->rq_next, which would be NULL if its the last I/O in the queue, and that's called only by this function, not before it). On the SSD driver, seeing NULL would mean that the current I/O request is dropped, because the I/O occurs at interrupt time. For the FD driver, such a thing would indicate that the next I/O request would be dropped, since I/O occurred prior to the interrupt.

I have been studying the add_request and end_request code hard, and I think there may be a race condition where a request gets silently dropped, which this seems to indicate. I'm wondering whether you've seen anything like this in the data file corruption. What it could look like (if occurring in the middle of a file) would be junk disk data from the dropped file I/O request, followed by correct data. The inode would be correct. Or it could manifest where the fsck fails if the dropped I/O is in the filesystem inode area. More investigation happening.

This is interesting indeed. I haven't seen this one, and I've been juggling with different machines these past days seeing varying results and I need to organize the notes to see if there is anything real there.

Finally - I will have a fix shortly for the FAT filesystem running data through L1. This should be an easy fix. I may be able to get most of the metadata out of L1 as well. I'll keep you posted on that.

YES, this would be very useful. I didn't complete my test case without using FAT yet, but the first looks were suspicious (on Minix only).

--M

ghaerr commented 1 year ago

when I change the test script to copy multiple different files at the same time, I am seeing some data corruption on occasion.

there is still some buffer related bug lurking around. I'm getting some unexplainable errors when moving between machines.

but the first looks were suspicious (on Minix only).

Went 8+ hours deep into a MINIX fs rabbit hole yesterday... and finally caught the rabbit! Wow, what an amazing journey. And realized there are many nuances to exactly how the kernel operates when async I/O is allowed. While the kernel is never re-entered, one needs to realize that the current task may get put to sleep when definitely not expecting it. Take for instance unmap_brelse - the reason for the latest major MINIX fs bug fix I've finally found - who would think the current task may get scheduled out during that call? unmap_brelse calls unmap_buffer, which essentially just decrements b_mapcount, and then calls brelse, which calls wait_on_buffer. Although I don't really understand why brelse needs to wait for any I/O to complete, if that buffer happens to be in any request queue, the current task will lose the CPU.

After perfecting the SSD driver, for which I've added even more request queue checking, and changing my test script to a bash function that can easily be called hundreds of times - I finally found and nailed the (lol, hopefully last, again) MINIX fs bug, with now no failures. This most recent bug occurs whenever two processes might be writing to new (and separate) files at the same time, and extending the files. The specific bug is in the map_iblock code which is responsible to add an additional new block number to a file using the first indirect block. Here's the original source, see if you can spot the error (if you don't, no problem, it took me 8 hours to find this):

static unsigned short map_iblock(struct inode *inode, block_t i,
                                 block_t block, int create)
{   
    register struct buffer_head *bh;
    register block_t *b_zone;

    if (!(bh = bread(inode->i_dev, i))) {
        return 0;
    } 
    map_buffer(bh);
    b_zone = &(((block_t *) (bh->b_data))[block]);
    if (create && !(*b_zone)) {
        if ((*b_zone = minix_new_block(inode->i_sb))) {
            mark_buffer_dirty(bh);
        }
    }
    unmap_brelse(bh);
    return *b_zone;
}

Spolier alert: the function returns the new block number via pointer indirection. That pointer is to the indirect block, which sits in a system buffer. That buffer was mapped to the kernel data segment (and thus L1) in order to be addressable. The problem is, unmap_brelse may cause a task switch, and what was happening is that other process pressure resulted in the same L1 buffer being reused, so that the b_zone pointer pointed not to an indirect block, but random data based on whatever happened to refill L1.

The second problem required to get the MINIX fs to function without error after a thousand runs of async file create/writes was to change the location of wait_on_buffer to the very first thing map_buffer does:

void map_buffer(struct buffer_head *bh)
{
    ext_buffer_head *ebh = EBH(bh);
    int i;

    /* wait for any I/O complete before mapping to prevent bh/req buffer unpairing */
    wait_on_buffer(bh); // <--- new location required for no errors

    /* If buffer is already mapped, just increase the refcount and return */
    if (bh->b_data /*|| ebh->b_seg != kernel_ds*/) {
        if (!ebh->b_mapcount)
            debug("REMAP: %d\n", buf_num(bh));
        goto end_map_buffer;
    }

    //wait_on_buffer(bh); // <-- old location

What was happening is that I had assumed an "optimization" by allowing the wait for I/O to complete to only be required if the buffer wasn't already mapped, as an attempt at speedup. But the problem is, likely close to Linus' comment in the v2.0 sync code, that permanently kept MINIX Z/I-map system buffers are specially coded for speed, and they may get dirtied without the usual brelse call, which synchronizes with physical disk with its wait_on_buffer. Now, any call to map_buffer guarantees physical I/O sync, even if it's already mapped!

I will be cleaning up all this and posting a PR to ELKS shortly. This contains the above two fixes required for MINIX fs operation without error and some tweaks to the SSD driver. A big note about the async I/O complete callback routine - in the first version of the SSD driver, ssd_io_complete would loop through requests until they are no more, instead of only handling the next request. This had the effect of increasing the speed of I/O, which as you know changes all sorts of things. Most of the block drivers are coded to operate in for(;;) loops. The net learning on all this is that I now recommend using return after end_request(1) and letting the request queue be emptied normally. This also fixed the occasionally ssd_io_complete: NULL request seen. This is something to keep an eye out on the FD driver, as it must be ensured there is in fact another interrupt coming (that is, EOI's must match each request in the request queue).

There remain two problems - first, with MINIX, we still have a problem when writing to THE SAME FILE at the same time. While the file contents end up correct with my latest fix, fsck fails with a unused file block entry on the filesystem afterwards. I'm not sure whether this is preventable or not. The other problem is the FAT driver running all file data through L1 as previous discussed. While not a bug, I'm still planning on a fix for that.

Mellvik commented 1 year ago

Well, that's a cork popper if I ever saw one, @ghaerr! Congrats - a remarkable catch!

And no, I did not catch the bug from the first listing :-) - that's subtle indeed!

I'm very much looking forward to the fix. It may just be the reason for the sporadic FS screwups I've been seeing these last few days - as the previous bugs have been weeded out.

Spolier alert: the function returns the new block number via pointer indirection. That pointer is to the indirect block, which sits in a system buffer. That buffer was mapped to the kernel data segment (and thus L1) in order to be addressable. The problem is, unmap_brelse may cause a task switch, and what was happening is that other process pressure resulted in the same L1 buffer being reused, so that the b_zone pointer pointed not to an indirect block, but random data based on whatever happened to refill L1.

This is indeed a hard one to see even when served on a platter. Really great work.

The second problem required to get the MINIX fs to function without error after a thousand runs of async file create/writes was to change the location of wait_on_buffer to the very first thing map_buffer does:
void map_buffer(struct buffer_head *bh)
{
    ext_buffer_head *ebh = EBH(bh);
    int i;

    /* wait for any I/O complete before mapping to prevent bh/req buffer unpairing */
    wait_on_buffer(bh); // <--- new location required for no errors

    /* If buffer is already mapped, just increase the refcount and return */
    if (bh->b_data /*|| ebh->b_seg != kernel_ds*/) {
        if (!ebh->b_mapcount)
            debug("REMAP: %d\n", buf_num(bh));
        goto end_map_buffer;
    }

    //wait_on_buffer(bh); // <-- old location
What was happening is that I had assumed an "optimization" by allowing the wait for I/O to complete to only be required if the buffer wasn't already mapped, as an attempt at speedup. But the problem is, likely close to Linus' comment in the v2.0 sync code, that permanently kept MINIX Z/I-map system buffers are specially coded for speed, and they may get dirtied without the usual brelse call, which synchronizes with physical disk with its wait_on_buffer. Now, any call to map_buffer guarantees physical I/O sync, even if it's already mapped!

In a way this one is even harder - because you have to know (and remember) that oddity about brelse being bypassed.

I will be cleaning up all this and posting a PR to ELKS shortly. This contains the above two fixes required for MINIX fs operation without error and some tweaks to the SSD driver. A big note about the async I/O complete callback routine - in the first version of the SSD driver, ssd_io_complete would loop through requests until they are no more, instead of only handling the next request. This had the effect of increasing the speed of I/O, which as you know changes all sorts of things.

Interesting, it has been my intention to change the floppy driver to do this - clean out the req queue while at it.

Most of the block drivers are coded to operate in for(;;) loops. The net learning on all this is that I now recommend using return after end_request(1) and letting the request queue be emptied normally. This also fixed the occasionally ssd_io_complete: NULL request seen. This is something to keep an eye out on the FD driver, as it must be ensured there is in fact another interrupt coming (that is, EOI's must match each request in the request queue).

Hmmm, I'm not sure I get this. What's actually the benefit of the 'normal' way of emptying the queue?

There remain two problems - first, with MINIX, we still have a problem when writing to THE SAME FILE at the same time. While the file contents end up correct with my latest fix, fsck fails with a unused file block entry on the filesystem afterwards. I'm not sure whether this is preventable or not.

The problem is the unused block, right? The thing is (my perspective) that hit is a 'not supposed to happen' situation in that the results (unless app level locking is used) is unpredictable as far as content is concerned. That said, the extra block is definitely a bug, and AFAIK on to put on the issues list for a rainy day - unless it's likely to have other consequences as well.

The other problem is the FAT driver running all file data through L1 as previous discussed. While not a bug, I'm still planning on a fix for that.

I'm very interested in this one. I'm still looking into the issue with the data buffers more or less randomly using L1 or L2 addresses when entering the REQ queue. Using MINIX fs as a source for the copy operation instead of FAT, exacerbated the 'problem' (symptom rather), not the opposite. So I'm heading into this one to understand exactly what happens.

Possibly related although I don't think so, is the observation that unmap_buffer quite frequently is called many times on the same BH address with map count not changing and map_buffer not called in between. E g running ps with a printk in unmap_buffer seems remarkable - and really triggered my curiosity:

void unmap_buffer(struct buffer_head *bh)
{
    if (bh) {
        ext_buffer_head *ebh = EBH(bh);

        printk("UB:%x/%d;", bh, ebh->b_mapcount);
        if (ebh->b_mapcount <= 0) {
...

tlvc15# ps
UB:e1b0/1;UB:e270/1;UB:e2d0/1;UB:e1b0/1;UB:e210/1;  PID   GRP  TTY USER STAT CPU  HEAP  FREE   SIZE COMMAND
UB:e1b0/1;UB:e378/1;    1     0      root    S   0  3072  2010  12928 /bin/init 3 
UB:e1b0/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;   13    13    1 root    S   0  1188 13332  71344 -/bin/sh 
UB:e1b0/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;   14    14   S0 root    S   0  1188 13410  71344 -/bin/sh 
UB:e1b0/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;   21    14   S0 root    R   0  1024  1182  11360 ps 
    7     7      root    S   0  1024 34750  75824 ktcp -b -p ne0 10.0.2.15 10.0.2.2 255.255.255.0 
    9     9      root    S   0     0  1994   9760 telnetd 
UB:e1b0/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;
UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;
UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;
UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;UB:e210/1;UB:e1b0/1;UB:e210/1;   11     2   S0 root    S   0     0  7318  28720 ftpd -d 
tlvc15#

ghaerr commented 1 year ago

Hmmm, I'm not sure I get this. What's actually the benefit of the 'normal' way of emptying the queue?

I suppose I got convinced that the 'normal' way was better since when I had the SSD driver coded to clean out the queue during a single interrupt, the MINIX fs bug surfaced in a repeatable way. I became very concerned that the SSD driver was "losing" a request, and went to great lengths reviewing the queue/dequeue code, but was never able to show any I/O being dropped (it would be quite insidious, as the I/O would disappear and never marked complete, possibly just sleeping the sync_buffer task). Then I realized that ONLY with the SSD driver, it didn't actually matter, as long as the next faked interrupt request was matched to another I/O request. The I/O would be quicker by cleaning out the queue, and that ended up triggering the latest bug. But the whole purpose of adding an IODELAY to the driver was to simulate floppy I/O. With the SSD driver, the he SSD 'interrupts' are actually HW timer callbacks made up by the driver itself. The FD driver is much more restricted in its allowed mode of operation, due to real HW interrupts; see below.

Interesting, it has been my intention to change the floppy driver to do this - clean out the req queue while at it.

Bottom line is that the driver can only "clean the request queue out" as long as the newly current request won't have a future interrupt coming for it, in which case it has to stay as the current request.

For the SSD and FD drivers - one has to be absolutely sure that each expected 'next' interrupt is matched with the current request, and that the next request, if found, needs to stay queued if there's another interrupt expected for its completion. Basically, any incoming interrupt needs to match with the the head of the current request queue at all times. I ended up adding explicit code in the SSD driver to display "request NULL" etc to make sure I finally got it exactly right. If they get mismatched, there's the possibility of an I/O request being dropped entirely or not serviced correctly by the next interrupt.

unmap_buffer quite frequently is called many times on the same BH address with map count not changing and map_buffer not called in between.

No, unmap_buffer definitely decrements b_mapcount every time through, which brings it to 0. Then (and absolutely must) map_buffer must called again. They must be paired. See the following code:

void unmap_buffer(struct buffer_head *bh)
{   
    if (bh) {
        ext_buffer_head *ebh = EBH(bh);

        if (ebh->b_mapcount <= 0) { // <-- this is just a debug comment if ever called with 0 mapcount
            printk("unmap_buffer: %d BAD mapcount %d\n", buf_num(bh), ebh->b_mapcount);
            ebh->b_mapcount = 0;
        } else if (--ebh->b_mapcount == 0) { // <-- this decrements the count every time, and wakes up anyone waiting!
            debug("unmap: %d\n", buf_num(bh));
            wake_up(&L1wait);
        } else
            debug("unmap_buffer: %d mapcount %d\n", buf_num(bh), ebh->b_mapcount+1);
    }
}

What you're seeing is the fact that tons of kernel code ends up using the same buffer many, many times, so its good we have an L1 cache. (But only actually needed if the data contents themselves have to be accessed). This is why all the MINIX Z/I maps are buffered, and not released until unmount. Mounting a 65K MINIX HD uses 8 full buffers for Z-maps and at least a couple for the I-map table. The tables are bitmaps, one bit for each used for free block/inode in the filesystem.

The problem is the unused block, right? The thing is (my perspective) that hit is a 'not supposed to happen' situation in that the results (unless app level locking is used) is unpredictable as far as content is concerned. That said, the extra block is definitely a bug, and AFAIK on to put on the issues list for a rainy day - unless it's likely to have other consequences as well.

Yes - that's the bug. I've added it to my list.

Mellvik commented 1 year ago

Sorry, I said 'floppy driver' while thinking hd driver, which is the one in the process of becoming interrupt driven. The floppy driver already does this.

Can you post the updated ssd driver?

I'll dive more into the unmap_buffer case. Like I said, map_buffer is not being called, so there has to be a remap, but the unmaps don't add up (yet). :-)

-M

ghaerr commented 1 year ago

Can you post the updated ssd driver?

Yes, its posted as PR 1651, and will be commited shortly where it'll be easier to look at. Some source diffs are getting larger as I've started the process of re-tabbing source to 4-character tabs using spaces only, to eliminate the very aggravating problem of half-4-char and half-8-char tabs mixed in the same file throughout the entire source base!

Like I said, map_buffer is not being called, so there has to be a remap, but the unmaps don't add up (yet). :-)

Yes - if you find an unmatched map_buffer/unmap_buffer in the source, please point me to it! (There could be many cases where they don't always match up during async kernel operation, especially when unmap_brelse is called, as described above).

ghaerr commented 1 year ago

I'm still looking into the issue with the data buffers more or less randomly using L1 or L2 addresses when entering the REQ queue.

I rewrote the FAT filesystem read/write (and exec) code to use L2 buffers directly rather than copying all I/O data into and out of L1 buffers. See ELKS PR 1652. After adding that change, you shouldn't see any FAT file data going through L1 (although metadata will still).

I also found immediate stack overflow when using FAT, async I/O and /bin/sh. Increasing the kernel stack size from 640 to 700 solved that problem for the time being, with a measured stack usage of 668 bytes max.

ghaerr commented 1 year ago

Hello @Mellvik,

map_buffer is not being called, so there has to be a remap, but the unmaps don't add up (yet). :-)

I added a kind of cool display of all system buffers, which I thought you might find useful. I am using it as part of a scheme to more fully understand what is happening during normal kernel operation. This is in PR 1655. The display shows Locked, Dirty and Uptodate bits, along with whether or not a buffer is mapped into L1 (shown as L1-L12) and its mapcount and buffer use count:

#  1: buf   7 dev 0x0200 block    44   U mapped L12 0 count 0
#  2: buf  17 dev 0x0200 block     5   U mapped L10 0 count 0
#  3: buf  10 dev 0x0380 block    27   U mapped L08 0 count 0
#  4: buf  18 dev 0x0380 block    29   U mapped L09 0 count 0
#  5: buf   3 dev 0x0200 block    43   U mapped L06 0 count 0
#  6: buf  19 dev 0x0200 block    42   U mapped L05 0 count 0
#  7: buf   6 dev 0x0200 block    41 LDU mapped L03 0 count 1
#  8: buf  16 dev 0x0200 block     4   U mapped L07 0 count 0
#  9: buf  15 dev 0x0200 block    40   U mapped L01 0 count 0
# 10: buf  12 dev 0x0200 block    39 L U mapped L00 0 count 0
# 11: buf   4 dev 0x0200 block    38 L U mapped L00 0 count 0
# 12: buf  14 dev 0x0200 block    37 LDU mapped L00 0 count 1
# 13: buf   5 dev 0x0200 block    34  DU mapped L00 0 count 0
# 14: buf  13 dev 0x0200 block    32   U mapped L02 0 count 0
# 15: buf   8 dev 0x0200 block     3  DU mapped L04 0 count 1
# 16: buf  11 dev 0x0200 block     2  DU mapped L11 0 count 1
# 17: buf   9 dev 0x0200 block     1   U mapped L00 0 count 1
# 18: buf   2 dev 0x0380 block     3   U mapped L00 0 count 1
# 19: buf   1 dev 0x0380 block     2   U mapped L00 0 count 1
# 20: buf   0 dev 0x0380 block     1   U mapped L00 0 count 1
Total buffers inuse 11/20

The code to this is is:

static void list_buffer_status(void)
{
    int i = 1;
    int inuse = 0;
    int isinuse, j;
    struct buffer_head *bh = bh_llru;
    ext_buffer_head *ebh;

    do {
        ebh = EBH(bh);
        isinuse = ebh->b_count || ebh->b_dirty || ebh->b_locked || bh->b_mapcount;
        if (isinuse || bh->b_data) {
            j = 0;
            if (bh->b_data) {
                for (; j<NR_MAPBUFS; j++) {
                    if (L1map[j] == bh) {
                        j++;
                        break;
                    }
                }
            }
            printk("#%3d: buf %3d dev %D block %5ld %c%c%c mapped L%02d %d count %d\n",
                i, buf_num(bh), ebh->b_dev, ebh->b_blocknr,
                ebh->b_locked?  'L': ' ',
                ebh->b_dirty?   'D': ' ',
                ebh->b_uptodate?'U': ' ',
                j, ebh->b_mapcount, ebh->b_count);
        }
        i++;
        if (isinuse) inuse++;
    } while ((bh = ebh->b_prev_lru) != NULL);
    printk("Total buffers inuse %d/%d\n", inuse, nr_free_bh);
}

Interestingly, I found a lot fewer buffers in use than I had imagined (only 3 on an idle system).

This should help you track down the remaining L1 and other async I/O problems you're seeing. It may want to be expanded to display the buffer segment and addresses, and also match them up with the block driver request queue. This should be fairly straightforward by looping through the request queue matching addresses.

Mellvik commented 1 year ago

Thank you @ghaerr,Right down my alley for sure. Right now I'm looking at the buffer movements when copying a file from fat to flpy vs minix to flpy, it's very interesting (not necessarily sensible).This should help. I need to get the updated fat code over, probably a completely new scenario. -M18. aug. 2023 kl. 18:35 skrev Gregory Haerr @.***>: Hello @Mellvik,

map_buffer is not being called, so there has to be a remap, but the unmaps don't add up (yet). :-)

1: buf 7 dev 0x0200 block 44 U mapped L12 0 count 0

2: buf 17 dev 0x0200 block 5 U mapped L10 0 count 0

3: buf 10 dev 0x0380 block 27 U mapped L08 0 count 0

4: buf 18 dev 0x0380 block 29 U mapped L09 0 count 0

5: buf 3 dev 0x0200 block 43 U mapped L06 0 count 0

6: buf 19 dev 0x0200 block 42 U mapped L05 0 count 0

7: buf 6 dev 0x0200 block 41 LDU mapped L03 0 count 1

8: buf 16 dev 0x0200 block 4 U mapped L07 0 count 0

9: buf 15 dev 0x0200 block 40 U mapped L01 0 count 0

10: buf 12 dev 0x0200 block 39 L U mapped L00 0 count 0

11: buf 4 dev 0x0200 block 38 L U mapped L00 0 count 0

12: buf 14 dev 0x0200 block 37 LDU mapped L00 0 count 1

13: buf 5 dev 0x0200 block 34 DU mapped L00 0 count 0

14: buf 13 dev 0x0200 block 32 U mapped L02 0 count 0

15: buf 8 dev 0x0200 block 3 DU mapped L04 0 count 1

16: buf 11 dev 0x0200 block 2 DU mapped L11 0 count 1

17: buf 9 dev 0x0200 block 1 U mapped L00 0 count 1

18: buf 2 dev 0x0380 block 3 U mapped L00 0 count 1

19: buf 1 dev 0x0380 block 2 U mapped L00 0 count 1

20: buf 0 dev 0x0380 block 1 U mapped L00 0 count 1

Total buffers inuse 11/20

The code to this is is: static void list_buffer_status(void) { int i = 1; int inuse = 0; int isinuse, j; struct buffer_head bh = bh_llru; ext_buffer_head ebh;

do {
    ebh = EBH(bh);
    isinuse = ebh->b_count || ebh->b_dirty || ebh->b_locked || bh->b_mapcount;
    if (isinuse || bh->b_data) {
        j = 0;
        if (bh->b_data) {
            for (; j<NR_MAPBUFS; j++) {
                if (L1map[j] == bh) {
                    j++;
                    break;
                }
            }
        }
        printk("#%3d: buf %3d dev %D block %5ld %c%c%c mapped L%02d %d count %d\n",
            i, buf_num(bh), ebh->b_dev, ebh->b_blocknr,
            ebh->b_locked?  'L': ' ',
            ebh->b_dirty?   'D': ' ',
            ebh->b_uptodate?'U': ' ',
            j, ebh->b_mapcount, ebh->b_count);
    }
    i++;
    if (isinuse) inuse++;
} while ((bh = ebh->b_prev_lru) != NULL);
printk("Total buffers inuse %d/%d\n", inuse, nr_free_bh);

}

Interestingly, I found a lot fewer buffers in use than I had imagined (only 3 on an idle system). This should help you track down the remaining L1 and other async I/O problems you're seeing. It may want to be expanded to display the buffer segment and addresses, and also match them up with the block driver request queue. This should be fairly straightforward by looping through the request queue matching addresses.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>