Floppy track buffer revisited

Mellvik / TLVC

Tiny Linux for Vintage Computers

Other

9 stars 0 forks source link

Floppy track buffer revisited #81

Open Mellvik opened 6 days ago

Mellvik commented 6 days ago

Following the discussion in #80 which strayed into the usefulness of the floppy track buffer in TLVC/ELKS, I conducted a bunch of tests - all on XT class hardware using a 3.5in 720k drive and a classic (pure) 360k drive.

The results were unsurprising but useful for another round of discussions, qualified by the size of the track buffer, 9k bytes, in a system where every RAM byte counts. Here are the practical observations:

The track buffer has minimal effect on normal filesystem use. The difference between with-buffer and without-buffer is well below the margin of error.
The track buffer has significant effect when the system is starting, reducing the time to load /bin/init and other programs, typically ending with /bin/getty by 50%, sometimes more.
When, in normal file system use, a big program like vi is loaded, the track buffer advantage is noticeable, although less than 50%, more like 10-20%.

The results bring to the table the usual considerations and some new questions and ideas:

Being really useful only when used as a boot/root file system, the track buffer could be a compile time option.
The track buffer's benefits when using floppies for archiving and image transfers are gone with the arrival of the raw device driver, which - with proper use of the dd command - allows for higher efficiency than the track buffer could ever give.
Adding read-ahead to the buffer system is a low hanging fruit since the code is in the system already, and could deliver an improvement in both system load time and general use.
Keeping, but cutting the track buffer in half is an interesting compromise. 4.5k bytes freed up, and - for the (XT) setting at hand - only half the buffer is used anyway.
It may also be possible - although it would require some smart coding - to make changes so that the track buffer can be released after system load is complete. Either as part of the heap or in a far section of memory.
As an alternative to the read-ahead buffer option, it may be possible to satisfy a request and then let the driver fill the track buffer 'in the background' from that particular track, thereby eliminating the delay introduced by reading an entire track when just 2 sectors are needed - and still have the entire track cached. This could be combined with the reduction of the track buffer by 50% and make up a quite efficient - although somewhat code intensive - solution.

And finally the easiest option - leave it as it is under the motto 'if it ain't broken, don't fix it'. Admittedly I don't like that. In this day and time, making a big deal out of 9k bytes may seem borderline insane, but TLVC and ELKS address a different reality. The world as of pre-1990. When resources were scarce and low level smartness not only valuable but required.

ghaerr commented 5 days ago

Thank you for this very concise and nice analysis. The results and your suggestions indicate a couple very good possibilities, with of course the understanding that various users have differing priorities, which probably should remain available. I'll comment on each of your ideas/findings below.

The track buffer has minimal effect on normal filesystem use. The difference between with-buffer and without-buffer is well below the margin of error.

How many EXT buffers were configured? While I'm initially a bit surprised by this result, it seems to show that track buffering is not needed during normal operations of the system, and we have EXT and XMS buffers that can greatly help in the normal (non-boot) case.

The track buffer has significant effect when the system is starting, reducing the time to load /bin/init and other programs, typically ending with /bin/getty by 50%, sometimes more. When, in normal file system use, a big program like vi is loaded, the track buffer advantage is noticeable, although less than 50%, more like 10-20%.

These two combined seem to show that the benefits of track buffering are apparent when the file blocks are contiguous, which is almost always the case when reading/running a number of executables right after one another. Since the blocks have never been read before, they won't be in the EXT buffers.

Being really useful only when used as a boot/root file system, the track buffer could be a compile time option.

On ELKS, CONFIG_TRACK_CACHE is still an option, and it seems that adding back that option for TLVC would be very nice for being able to turn off the option to get the memory back when having the most main memory available is important. Max main memory is important in at least two cases currently talked about by users: running large games (e.g. Doom) and running lots of network daemons, telnet sessions etc (along with increased task=).

As a result of your testing, I am going to suggest that the ELKS Doom users configure for no track caching, which will increase main memory by 9K (we're still not able to run a special demo that requires max mem). It will be interesting to learn how much slower reading a 156k executable from disk takes without caching).

It may also be possible - although it would require some smart coding - to make changes so that the track buffer can be released after system load is complete. Either as part of the heap or in a far section of memory.

Actually, adding back the track buffer after boot may not that difficult - all that is needed is a single call to seg_add with the start end end segment numbers, and the ability to turn off caching in the block driver. Of course, this would have to be done using a special mechanism way after kernel init, perhaps with a program the user runs or a special program run in the last line of /etc/rc.sys.

The released memory can't be added into the kernel data section, but would be made available as another far section of main memory, just like the umb= sections are. This extra 9k would be possibly useful to run telnet or other servers in, but it won't be contiguous and thus less or not useful for big games.

Adding read-ahead to the buffer system

My take is that read-ahead will not be useful except for reading executables. So that functionality would be turned on, then off, in fs/exec.c.

As an alternative to the read-ahead buffer option, it may be possible to satisfy a request and then let the driver fill the track buffer 'in the background' from that particular track,

If the track cache is retained, it would seem to me that just continuing to use it normally would be better than developing a read-ahead solution, especially since system performance is likely to improve only for executable loading. I would say read-ahead would be preferable to a semi-track-cache-on/off option.

Also, when reading executables, this is very much tied to the number of EXT/XMS buffers available and whether the program has run before. If the system is actually a newer 386 with 2M RAM, then just setting a large XMS buffer count probably solves all these issues without caching or read-ahead. OTOH, if the system is ancient, things will be slow anyways. The middle ground, a system with 640k with no track caching would probably have to be configured for 8-64k EXT buffers, which somewhat throws a wrench into the idea of saving RAM for other applications. A possible solution here would be to use something akin to using one of the executable "mode" bits to indicate a priority for saving in an EXT buffer; this could get complicated to implement though.

I have another idea cooking which involves using HMA for disk buffers, but that only helps users with 1M+ RAM, and XMS handles that case.

This seems complicated. Overall, it a user doesn't mind waiting for executables to load and needs max main memory, turning off track cache and running with tuned EXT/XMS buffers is what the system is designed for, without having to add special handling at the low level of the block driver. Getting smarter about using the L2 buffer cache might be a better first alternative to adding read-ahead. We already know that L1 use is minimized now, but I'm not sure the L2 usage is well known.

Keeping, but cutting the track buffer in half is an interesting compromise. 4.5k bytes freed up,

I personally don't see how using half a track buffer buys much at all, unless the testing data is showing otherwise. The saved 4.5-9k would be IMO better allocated to EXT buffers and let the buffer system do its job, with other tuning added there if needed.

Thank you for the test results, very interesting discussion!

In conclusion, I'm wondering whether the best option might be to have track caching turned OFF by default, and let the system boot a bit slower, but reap the benefits of extra memory after boot. No other fancy development needed, until further tuning, including mkfs/mfs disk/inode layout and L1/L2 tuning determines otherwise. If read-ahead is easily added, it might be worth doing, but there's no guarantee that it will actually increase throughput, depending on exactly how fast the fs/exec.c handles the data. IIRC fs/exec.c read the code and data section as a single I/O request, then reads the relocation data w/separate read requests (buffered of course by the buffer system) 8 bytes at a time. (We also have compressed executables which decrease file size by ~33%. This is the standard distribution format now for ELKS).

making a big deal out of 9k bytes may seem borderline insane

Heck no, I worry about 1K bytes extra per executable to keep the load time(s) down. 9k bytes is a big deal.

Mellvik commented 5 days ago

Thank you @ghaerr. It occurs to me that what makes this discussion worthwhile (and interesting) is the fact that we're approaching it from entirely different angles. Somewhat simplified - mine is from the XT side (low end, 640k max, maybe less), yours is from the 386 and more resourceful end.

I agree with most of your conclusion although not necessarily how to get there (that's what makes it interesting, isn't it?). A few points:

The discussion is relevant only if a floppy is used to boot and run the system. This is unlikely for a high end system except in a debugging situation or a crisis (something broke), IOW a short term thing. On a low end XT type system it may be a permanent or semi-permanent situation (no disk or no driver available), and optimization becomes a necessity.
XT type/level systems have 360k or 720k floppies. These floppies have 9 sectors per track, so the 18 sector track buffer is always 50% unused. This was the motivation for introducing the full cylinder cache a while back, using the entire buffer, which turned out to be less than helpful. And this is the rationale behind my suggestion to possibly reduce the track buffer by 50%. 4.5K wasted all the time is bad. A possible solution is - if the track buffer is kept - to make the size a compile time/configuration option too.
That said, it may be argued that if a 9 sector track buffer is good enough for the systems that need the performance boost most, it's good enough for all.
I think the assumption that performance is primarily related to loading progams is wrong. Most programs and most of the work we do on the system, manipulate data, i.e. files, in many (most?) cases many files per invocation of a program. Copying, filtering, merging, counting, summing, splitting, archiving, scripting etc etc etc. So it's a mix and the system is file access intensive meaning there is frequent access to metadata which in turn is purging the track buffer. Thus my benchmarks were doing file copying, to/from/between floppy disks and always purging the buffer system in between. The test were confirming what I'd seen in general use of the system over weeks.

Turning attention to the system (EXT) buffers, I've been monitoring their behaviour off and on (mostly while looking for problems) since we worked on them last year (possibly earlier this year). Unsurprisingly a low number of L2 buffers have litte effect as they're being purged all the time. When able to keep metadata between transactions, the effect becomes very significant. It may be interesting to look closer at this to get some firm numbers, but again the results/effects would be very dependent on usage and thus not necessarily generally representative in any way. I would think though, and I seem to remember we've touched on this before, that adding something akin to a sticky bit to metadata buffers may have interesting effects - the root directory and its inodes in particular.

As to read ahead, I've been planning to look at what's in the sources already, to see - possibly test - if it might be useful. On slow systems (or slow subsystems like floppies and MFM disks, possibly older IDE disks) it should make a difference. There's a good reason such read ahead was always used by OSes before the drives themselves got caches.

It would be useful to discuss and agree on a (set of) benchmark(s) to use for some of these tests, in particular when we get into L1/L2 tuning.

Finally, back to the floppy driver: With the track buffer gone or optional, the 1k bounce buffer will return, this time possibly as an heap allocation at open time. That way another 1k is saved unless floppies are actually in use.

ghaerr commented 5 days ago

Very good points, I am in agreement.

These floppies have 9 sectors per track,

I missed that - of course - I now see that a 4.5k track buffer makes good sense, especially for the slowest (XT) systems. It would be interesting to learn how well a 50% track buffer works for larger floppies.

the assumption that performance is primarily related to loading progams is wrong.

there is frequent access to metadata which in turn is purging the track buffer.

Yes - this is what I was trying to say - the performance increase from the track buffer is primarily when loading executable programs, as the track cache reads consecutive sectors useful for exec loading, vs normal operation, where metadata causes buffer purge.

I've been planning to look at what's in the sources already

It's very minimal, and there's two versions: one is a simple block readahead in the buffer.c code, which IIRC just schedules the next file block to be read into a buffer, and the other is the MULTIBH stuff in ll_rw_blk.c which would be a lot more complicated to get working, as it tries to assemble an array of buffer headers for which to do I/O into. It is dubious what the benefit of that would be for us, as the DMA isn't directly across each of those blocks, and there's also the problem of finding a bunch of free buffer headers in a system that is likely tight on buffers in the first place.

The above is why I was kind of suggesting the idea of splitting thoughts between the buffer system doing its job better vs trickery below the buffer level in the block driver.

[EDIT: Now that I think of this a bit more, I can see how the MULTIBH multi-buffer I/O code would be a good solution for early 32-bit Linux, as it solves the problem of reading longer stretches of (usually contiguous) file data ahead of time, and doesn't require additional unused "track cache" for storage, which is instead allocated towards normal system buffers. Perhaps the multi-bh approach should be considered more seriously, providing that the DF driver can read consecutive sectors in SEPARATE I/O requests as fast as using a single I/O request for multi-sectors.]

With the track buffer gone or optional, the 1k bounce buffer will return, this time possibly as an heap allocation at open time. That way another 1k is saved unless floppies are actually in use.

In ELKS, DMASEG is always present, as there's no current way to allocate a kernel data segment buffer that doesn't possibly cross a 64k boundary, depending on exactly where kernel DS was loaded. (In almost all cases its OK, until someone builds a kernel that happens to create the kernel data section such that the static or allocated buffer appears in the wrong place. And IIRC its only the very early systems that had DMA chips that didn't have the address lines and wrapped I/O). The buffer allocation code does ensure that the EXT buffers, which are of course outside the kernel data segment, are always aligned to a 1K boundary, thus avoiding DMA wrap.

Mellvik commented 4 days ago

In ELKS, DMASEG is always present, as there's no current way to allocate a kernel data segment buffer that doesn't possibly cross a 64k boundary, depending on exactly where kernel DS was loaded.

OK, maybe that's not so bad. Just shrink the buffer to 1k and keep the logic.

And IIRC its only the very early systems that had DMA chips that didn't have the address lines and wrapped I/O).

I'm not getting this, the chip has always been the same ...

Perhaps the multi-bh approach should be considered more seriously, providing that the DF driver can read consecutive sectors in SEPARATE I/O requests as fast as using a single I/O request for multi-sectors.

it can not, there will always be a revolution in between (200ms). Sounds complicated, but I'll take a look and see if I understand it...

benchmarking the effect (or lack of such) of the simple variant is a low hanging fruit, I'll do that with the track buffer off and see what we get.

thanks @ghaerr, this is good.

ghaerr commented 4 days ago

I'm not getting this, the chip has always been the same ...

This issue goes waaayyy back with ELKS, and got more complicated because its not just the hardware, but different BIOS INT 13h dealt with the problem differently. Here's some more information on what I'm talking about - I was slightly incorrect - it's not the DMA chip but another motherboard chip that is/was used that stores the upper 4 bits of the physical address that causes the problem.

From my memory, there are at least these different cases that need to be specially handled with regards to this issue:

BIOS INT 13h read sector calls must ensure that I/O is not requested crossing a 64k physical barrier.
Direct Floppy driver needs to ensure the same thing
EXT buffers are aligned to block (1K) boundary to ensure block read/writes don't cross 64k barrier
XMS buffer read/write uses DMASEG since actual buffer isn't directly accessible (but still may use track cache)
Application read/writes to user mode buffer may need bounce buffer if not aligned (almost never will be aligned and may still use track cache).

The BIOS cases are handled through a two function layers and handled "below" the driver layer. IIRC this was also done for DF driver, at least in my version (copied from your original version which worked/tested for XMS and DMA overlap).

maybe that's not so bad. Just shrink the buffer to 1k and keep the logic.

Yes - IMO far better to have the 1K outside the kernel data segment, then use two function layer for all floppy I/O that automatically redirects to bounce buffer only when needed (EXT no problems they're aligned, XMS always, user buffer maybe).

there will always be a revolution in between (200ms).

In that case, a multi-buffer-header solution using multiple I/O requests won't work well and I don't even recommend considering it.

However - I am thinking of an alternative approach never coded for Linux, that might work for us quite well: add a function layer "above" the I/O request layer that tries to gather as large as possible actually contiguous EXT buffers and then use a single (specially modified) buffer header to schedule I/O into all the buffers at once. If it were XMS or had alignment issues, the special handling would be prohibited. Otherwise, the EXT buffers are already guaranteed aligned so a multi-sector read could be easily sent to the FDC or BIOS.

The possible hard part here would be writing the routine (possibly calling sync_buffers to flush all buffers if needed) to try to identify then grab, as many contiguous buffers as possible. The nice part would be that we would be using all EXT buffers for the multi-sector reads, and there would be no "purge" like we have today when metadata gets in the way. We should think more about this, and I'm definitely interested in your offer to benchmark track buffer performance as we might be able to duplicate that performance with this kind of enhancement.

Once the multi-sector I/O was complete, the buffer system would handle the buffers exactly as it does now. It would be guaranteed that at least the requested single buffer was read, and there may or may not be readahead/track cache depending on what the identification routine could do at that moment. As an add-on, aggregated multi-sector writes could also be handled this way for speed using the raw driver.

ELKS doesn't have a raw block driver, but TLVC would benefit and essentially the multi-sector issues and track caching for both the block and char drivers would all be handled through one upper level function that would find contiguous buffers, and then the driver would handle the multi-sector request. The lower two levels would handle the XMS and bounce buffering, in which case the multi-sector request would basically be ignored, and those reserved buffers would be released without I/O on them.

Mellvik commented 3 days ago

I'm not getting this, the chip has always been the same ...

This issue goes waaayyy back with ELKS, and got more complicated because its not just the hardware, but different BIOS INT 13h dealt with the problem differently. Here's some more information on what I'm talking about - I was slightly incorrect - it's not the DMA chip but another motherboard chip that is/was used that stores the upper 4 bits of the physical address that causes the problem.

From my memory, there are at least these different cases that need to be specially handled with regards to this issue:

BIOS INT 13h read sector calls must ensure that I/O is not requested crossing a 64k physical barrier.

Direct Floppy driver needs to ensure the same thing

EXT buffers are aligned to block (1K) boundary to ensure block read/writes don't cross 64k barrier

XMS buffer read/write uses DMASEG since actual buffer isn't directly accessible (but still may use track cache)

Application read/writes to user mode buffer may need bounce buffer if not aligned (almost never will be aligned and may still use track cache).

The BIOS cases are handled through a two function layers and handled "below" the driver layer. IIRC this was also done for DF driver, at least in my version (copied from your original version which worked/tested for XMS and DMA overlap).

I thought so - had to check. There is really only one thing to keep in mind with DMA - as I pointed out in a the long thread we had when working on the buffer system update a while back (https://github.com/Mellvik/TLVC/pull/19#issuecomment-1687769468) The DMA chip is an 8bit family chip. It has no clue about addresses beyond 64k regardless. The page register - each bit pulling an address line, A16-A20 (in the XT case) that the DMA controller knows nothing about. Smart then, clumsy now ...

However - I am thinking of an alternative approach never coded for Linux, that might work for us quite well: add a function layer "above" the I/O request layer that tries to gather as large as possible actually contiguous EXT buffers and then use a single (specially modified) buffer header to schedule I/O into all the buffers at once. If it were XMS or had alignment issues, the special handling would be prohibited. Otherwise, the EXT buffers are already guaranteed aligned so a multi-sector read could be easily sent to the FDC or BIOS.

The possible hard part here would be writing the routine (possibly calling sync_buffers to flush all buffers if needed) to try to identify then grab, as many contiguous buffers as possible. The nice part would be that we would be using all EXT buffers for the multi-sector reads, and there would be no "purge" like we have today when metadata gets in the way.

Interesting. This could be done. And I have the same concern - how hard would it be to collect such a collection of buffers and not the least, how big would the buffer cache need to be in order to make it practical/useful. Like, the metadata will be as much of a disturbance as before unless the cache is big enough to hold both (I've been running with 24k L2 cache for quite some time - the blessings of an XT level system :-) ).

We should think more about this, and I'm definitely interested in your offer to benchmark track buffer performance as we might be able to duplicate that performance with this kind of enhancement.

My plan is to have system load timing data for both XT (9 sector) & AT (18 sector) systems with a) no track buffer, b) classic track buffer, c) half size track buffer (AT systems only of course) and simple buffer read-ahead (one block). All the time remembering that as far as the boot/system load time goes, it's only borderline useful since most systems will be running off of HD anyway.

Now, here's the thing about buffer read-ahead: Unless we implement a simplified version of what you suggest above, so the 'ahead' block can be included in the same read operation, there is hardly going to be any advantage because we'll be missing a rotation anyway. This in turn means that the block driver needs to read more than 2 sectors per operation, which the direct driver is capable of as is. OR - I just thought of this, the track buffer just turned into a 2k buffer and problem solved: The next block is already there for readahead to get and gets read before something else purges it (unless the read-ahead request ends up behind something else in the request-queue of course). Maybe that's the easy way to find if there is any real meaning to this exercise.

Yes, we need to think about it so we don't waste (too much) time.

Mellvik commented 3 days ago

These floppies have 9 sectors per track, I missed that - of course - I now see that a 4.5k track buffer makes good sense, especially for the slowest (XT) systems. It would be interesting to learn how well a 50% track buffer works for larger floppies.

Forgetting everything else experimental and benchmarking going on which may change the track buffer to something else, the easy way to optimize this is to let CONFIG_HW_PCXT force the size of the track buffer to 9 sectors. 4.5k bytes ...

ghaerr commented 3 days ago

Great discussion. I'm reminded that systems designers can't guess each user's needs all the time, and keeping compile or dynamic options for system operation is sounding like a good idea in this case. In summary, it seems that there are a couple important cases that need special option consideration and benchmarking:

(A) floppy only systems
(B) specially configured higher performance for systems with only 360k/720k floppies
(C) floppy boot then HD operation systems

Possible (combinations of) options include:

No track caching
9 sector track caching
18 sector track caching
Single block read-ahead (file offset vs next block number)
Multi-block read-ahead (consecutive sectors only, no cylinder increments)
Wait for complete vs early I/O return for each of the above needs consideration

It would seem that testing on as many scenarios as possible before heavy development would be quite beneficial. If CONFIG_TRACK_CACHE took a sector count parameter rather than y/n that would be quite easy and useful before testing.

Single block readahead needs discussion as to whether we're reading the next file block or next disk logical, but I would assume we're talking a separate I/O request, so this is not the same as a 4-sector (2k) track cache.

Multi-block reads require significant R&D but it might be interesting to write a routine in sync_buffers that tries to just count buffers that might be contiguous, to see whether this even makes sense without having to purge most the buffers in order to do such a thing (thus invalidating most of its usefulness, I would think).

In general, we want as little wait-for-complete as possible, for both track caching as well as read-ahead. This makes the track cache and driver code more complicated to do it right, but IMO would definitely help speed up operations.

half size track buffer (AT systems only of course)

Why AT only for half-track? I thought half-track cache applies just to any 360/720k system?

ghaerr commented 3 days ago

(A) floppy only systems (B) specially configured higher performance for systems with only 360k/720k floppies (C) floppy boot then HD operation systems

It seems to me that (B) should be most easily handled with a CONFIG_TRACK_CACHE=n and let the user configure for how much low memory to spend on caching. That leaves (C), for which you've said most of the performance enhancement comes only at boot, and (A), which wants the same boot speed increase but also would benefit primarily only from reading large executables.

So are we especially worried about early XT systems here? Or floppy performance in general? Can't early XT systems (B) performance be most easily solved by track caching, or maybe single block readahead? This needs to be measured somehow.

There's a lot of work adding multi-block readahead vs just using a track cache. Perhaps early-release track cache, where the lower level releases the I/O before the whole track is read would help? Probably not worth the effort, as one sector or the whole track is read in 200ms (although early release could halve that).

I'm trying to see whether a divide-and-conquer approach makes sense here, dividing off early XT systems from the normal TLVC/ELKS operations. It would seem to me we're not really trying to build a super-fast XT, or are we? TLVC maybe, but probably not ELKS.

Of course then there's the issue of how much RAM should/can be used to effect fast operation, which can really be pumped up with more EXT buffers and certainly XMS if the system has it. And the real task load on the system: games vs networking, and when does the kernel run out of space? If more EXT buffers (24 in your case) seems to run a reasonable system, perhaps trying to crunch down the sizes of other networking or sysutil programs makes sense. I need more input on this business of which type systems A-C or more we are considering. It probably needs expanding to include amount of RAM available in order to classify systems better for our proposed analysis.

ghaerr commented 3 days ago

here's the thing about buffer read-ahead: Unless we implement a simplified version of what you suggest above, so the 'ahead' block can be included in the same read operation, there is hardly going to be any advantage because we'll be missing a rotation anyway. This in turn means that the block driver needs to read more than 2 sectors per operation, which the direct driver is capable of as is. OR - I just thought of this, the track buffer just turned into a 2k buffer and problem solved: The next block is already there for readahead to get and gets read before something else purges it

Your whole train of thought here spells out the theory and reasoning for track caching (which is distinct from file read-ahead): instead of thinking of files, blocks, etc we concern ourselves with the physical media only: when a read request occurs on super-slow media, read that sector and everything else in front of it on the same track, as a way of radically improving performance in a single disk rotation. Reading just 2 sectors when there are 5 more on the same track won't likely improve performance for larger executables. And there is no way to "early-return" from a multi-sector I/O request. All of this is why track caching was born.

If you're thinking that extended track reading is in fact slowing things down, then I would suggest we analyze the average program load size in blocks not bytes - this may give input as to the average number of blocks actually needing reading in an average track read. Adding some printk's in fs/exec.c and keeping a running total might help for that.

Of course, all that then gets into the need to look at the floppy layout from mfs - I'm sure it is extremely important as to which executables are where, and full knowledge of all files opened (use DEBUG_FILE to see that) and where their inodes and data are on floppy (use fsck -lvv for that).

Mellvik commented 3 days ago

here's the thing about buffer read-ahead: Unless we implement a simplified version of what you suggest above, so the 'ahead' block can be included in the same read operation, there is hardly going to be any advantage because we'll be missing a rotation anyway. This in turn means that the block driver needs to read more than 2 sectors per operation, which the direct driver is capable of as is. OR - I just thought of this, the track buffer just turned into a 2k buffer and problem solved: The next block is already there for readahead to get and gets read before something else purges it

Your whole train of thought here spells out the theory and reasoning for track caching (which is distinct from file read-ahead): instead of thinking of files, blocks, etc we concern ourselves with the physical media only: when a read request occurs on super-slow media, read that sector and everything else in front of it on the same track, as a way of radically improving performance in a single disk rotation. Reading just 2 sectors when there are 5 more on the same track won't likely improve performance for larger executables. And there is no way to "early-return" from a multi-sector I/O request. All of this is why track caching was born.

Yes, this brings us back to where we started. And neither the arguments nor the facts have changed, but we are a lot wiser as to what options are available. It's a fact that for sequential reading/writing, track buffering is great. What I pointed out when we started was that with the exception of system startup, that is, on a running file system, such sequential reading is rarely the case (and we've discussed why).

This is an observation of fact from practical use of a system booted and running off of floppy for weeks. We're often waiting 200+ms for something that could/should take 50ms, that's a factor 4 slowdown. For a faster systems (AT and up), the delay is even higher percentagewise. Real (although very simple) measurements indicate there is minimal difference between track buffer and no track buffer on a running XT, which sent our discussion in a slightly different direction: How to spend limited resources (RAM this time, instead of time).

File read-ahead was never really on the table AFAIK, I'm assuming it's too complicated but admittedly haven't looked into it.

Given the fact that this will always be a very usage dependent issue, it may make sense to a) keep it very simple and b) make things configurable - something we've also discussed. I like your idea about making the size of the track buffer menuconfig-urable (in fact I'm already using a config that takes the track buffer down to 4.5k if CONFIG_HW_XT is defined). Good for testing and good for flexibility. Like testing the idea above about readahead.

Also, a less-than-track-size track buffer would/should enforce an interesting change - that the filling of the buffer starts at the requested sector, not the start of the track. I'm thinking this will reduce the amount of time spent reading sectors never used. Like, if on a 1.44M drive the requested sector is 10, the likely caching benefit is in sectors 12 to 18 while we're spending 100ms reading sectors 1-9). This fits well with the idea of making the buffer compile time configurable, effectively turning it into a sector cache instead of a track buffer. Using LBA addressing would make the driver logic even simpler than today.

With this setup, experimentation with optimal sizes for whatever the application would be easy, including testing whether a buffer read-ahead has any merit. Even enabling a variant of today's full track buffer would be easy using the 'new' logick and sufficient buffer size. Off the bat, the simplicity is appealing - easy to test, easy to discard if it turns out to be useless - like the full cylinder buffer.

ghaerr commented 3 days ago

The filling of the buffer starts at the requested sector, not the start of the track.

Sorry for the confusion - ELKS has always started at the requested sector to end of track. I didn't realize that the DF driver did not do that. Yes, I would recommend that change to the DF driver to speed things up.

Real (although very simple) measurements indicate there is minimal difference between track buffer and no track buffer on a running XT

That is great information - but didn't you also say that loading/running large(r) executables was, in fact, quite a bit slower when track cache was not enabled?

My summaries two posts above were written with the above two thoughts in mind.

experimentation with optimal sizes for whatever the application would be easy

I added a new system control facility that allows for changing kernel settings on the fly. This mechanism could be used to enble, disable or change the size of the track buffer during runtime for testing without rebooting, FYI.

it may make sense to a) keep it very simple and b) make things configurable

Agreed. I'm leaning towards adding a sector-size configurable track cache, which solves (B) systems above, but by default keeps track caching enabled for floppy based systems for (A) and (C). This track cache always reads from the requested sector to end of track when enabled; the DF driver will need modifications. It seems we need track cache in order to speed up reading executables, since in non-XMS operation or smaller (24k) EXT buffers, its unlikely those bits are buffered.

Mellvik commented 2 days ago

experimentation with optimal sizes for whatever the application would be easy

I added a new system control facility that allows for changing kernel settings on the fly. This mechanism could be used to enble, disable or change the size of the track buffer during runtime for testing without rebooting, FYI.

thanks, I didn't know that. On my wishlist for a long time, but very far down my list. I'll pick that up immediately.

it may make sense to a) keep it very simple and b) make things configurable

Agreed. I'm leaning towards adding a sector-size configurable track cache, which solves (B) systems above, but by default keeps track caching enabled for floppy based systems for (A) and (C). This track cache always reads from the requested sector to end of track when enabled; the DF driver will need modifications. It seems we need track cache in order to speed up reading executables, since in non-XMS operation or smaller (24k) EXT buffers, its unlikely those bits are buffered.

Mellvik commented 2 days ago

Hmmmm, safari on my ipad is acting up this evening, apologies for the noise.

it may make sense to a) keep it very simple and b) make things configurable

Agreed. I'm leaning towards adding a sector-size configurable track cache, which solves (B) systems above, but by default keeps track caching enabled for floppy based systems for (A) and (C). This track cache always reads from the requested sector to end of track when enabled; the DF driver will need modifications. It seems we need track cache in order to speed up reading executables, since in non-XMS operation or smaller (24k) EXT buffers, its unlikely those bits are buffered.

I'm implementing part of the 'sector cache' now - for testing and measurements. Then we can take it from there. Sysctl will speed that up, thanks.

BTW - congrats with the elks release, an impressive list of improvements. I need to take a closer look at many of those!

ghaerr commented 2 days ago

all that then gets into the need to look at the floppy layout from mfs - I'm sure it is extremely important as to which executables are where

Here's a quick analysis of the 360k MINIX boot floppy on ELKS, to give you a feel for what's happening. More coming after I've had a chance to look at inode block reads as well. This will allow us to understand more exactly what the drive is doing with seeks and possible track or half-track caching at boot. (*)=block reread, not counting inode buffering yet.

        Files opened at boot (32 inodes per block):
        Name ---------------------- Inode   Block       Comments ----------
            / (superblock)          -       1
            /                       1       8           All dirs need inode read first
            /dev                    64      343         Big seek, create first
            /dev/console            105     -
            /bin                    3       10
            /bin/init               5       12-17 (6)
                /etc                47      228
                /etc/inittab        57      239
                /dev/console        *105    -
            /bin/sh                 25      91-99 (7)
                /etc/rc.sys         50      231
                /etc/profile        52      234
                    /bin/clock      45      225-226 (2)
                    /etc/mount.cfg  48      229
                    /bin/date       19      76-77 (2)
            /etc/inittab            57      239
                /dev/tty1           96      -
                /bin/getty          33      174-178 (5)
                    /etc/issue      53      235
                    /dev/ttyS0      106     -
                /bin/getty          *33     174-178 (5)
                    /etc/issue      *53     235
                    /etc/hostname   -                       (not present)
                    /etc/hostname   *-
                /bin/login          44      219-223 (5)
                    /etc/passwd     59      241
                    /bin/sh         *25     91-99 (7)
                        /etc/profile   *52  234
                        /root          62   341
                        /root/.profile -                    (not present)
                /etc/hostname        -                      (on logout)

First look seems to show not a lot of consecutive sectors being read, and that reserving particular blocks or grouping programs together might help. Certainly large seeks can be trimmed down. With rotations at 200ms, how long does an average seek take?

ghaerr commented 2 days ago

More analysis on the business of filesystem file allocations into blocks, floppy disk sectors and track caches, for the purpose of understanding how these interact at a detailed level on a 360k floppy. It turns out this gets into a pretty deep rabbit hole, fast.

TL;DR: Running a 1K block filesystem on floppies with an odd number (360k = 9) sectors pretty much runs worse case, with blocks split across tracks after every 4th block - this ends up invalidating the track cache with the next half block read on the subsequent track, and issues a full track read for the last sector. "Consecutive" blocks doesn't mean speed unless the read requirement is four blocks or less and doesn't split across tracks.

Long version and backstory: Here's the layout of the first 9 blocks on a typically imaged 360k MINIX floppy:

MINIX layout
        Inode   Block 
                0               boot block
                1               super block
                2               imap used block bitvector (127 max)
                3               zmap used block bitvector (360 max)
        0-31    4               first inode block, 32 inodes/block
        32-63   5
        64-95   6
        96-127  7
                8       root directory (first data block)

A more detailed view of the same showing blocks and their CHS for the first 100k disk:

360k (9 sectors/track = 4.5 blocks/track):
                Block   CHS
0K              0       0,0,1-2         Boot block
                1       0,0,3-4         Super block
                2       0,0,5-6         Imap
                3       0,0,7-8         Zmap
4K              4       0,0,9-0,1,1     **Block 4 (1st inode) crosses track, same cyl
5K              5       0,1,2-3
                6       0,1,4-5
                7       0,1,6-7
                8       0,1,8-9         Root directory
9K              9-12    1,0,x
13K             13      1,0/1,x         Block 13 crosses track
14K             14-17   1,1,x
18K             18-26   2,x,x           Block 22 crosses track
27K             27-35   3,x,x           " 31
36K             36-44   4,x,x           " 40
45K             45-53   5,x,x           " 49
54K             54-62   6,x,x           " 58
63K             63-71   7,x,x           " 67
72K             72-80   8,x,x           " 76
81K             81-89   9,x,x           " 85
90K             90-98   10,x,x          " 94
99K             99-107  11,x,x          " 103

Notice the first inode block is split across tracks. More on that later. Every ninth block is split according to the formula (block - 4) % 9 == 0

Slow-reading blocks, and fast-reading track caches:

Slow blocks (1K):   4,13,22,31,40,49,58,67,76,85,94,103
                    Second half of slow block will invalidate cache and read another 4.5K
Fast caches (4K):   0-3, 5-8, 9-12, 14-17, 18-21, 23-26, 27-30, 32-35, 36-39

Lets now look at the blocks actually being used on a 360k floppy, using fsck -lvvf (these are similar to the list above, except this is a non-compressed floppy so the block numbers are different):

Forcing filesystem check on /dev/fd0.
    1 040755 10 /                          192 0/0 Z 8
    2 040755  2 /home                       32 0/0 Z 9
    3 040755  2 /bin                       608 0/0 Z 10
    4 100755  1 /bin/cat                   928 0/0 Z 11
    5 100755  1 /bin/init                 7504 0/0 Z 12 13 14 15 16 17 18 20
    6 100755  1 /bin/fsck                11872 0/0 Z 21 22 23 24 25 26 27 29
    7 100755  1 /bin/fdisk                7952 0/0 Z 34 35 36 37 38 39 40 42
    8 100755  1 /bin/net                  1926 0/0 Z 43 44
    9 100755  1 /bin/shutdown             1408 0/0 Z 45 46
   10 100755  1 /bin/df                   5408 0/0 Z 47 48 49 50 51 52
   11 100755  1 /bin/pwd                  1936 0/0 Z 53 54
   12 100755  1 /bin/mkfs                 4848 0/0 Z 55 56 57 58 59
   13 100755  1 /bin/printenv              400 0/0 Z 60
   14 100755  1 /bin/uname                 480 0/0 Z 61
   15 100755  1 /bin/makeboot             6976 0/0 Z 62 63 64 65 66 67 68
   16 100755  1 /bin/date                 2768 0/0 Z 69 70 71
   17 100755  1 /bin/setup                 992 0/0 Z 72
   18 100755  1 /bin/grep                 9552 0/0 Z 73 74 75 76 77 78 79 81
   19 100755  1 /bin/more                 3072 0/0 Z 84 85 86
   20 100755  1 /bin/umount                928 0/0 Z 87
   21 100755  1 /bin/sh                  48608 0/0 Z 88 89 90 91 92 93 94 96
   22 100755  1 /bin/edit                21376 0/0 Z 137 138 139 140 141 142 143 145
   23 100755  1 /bin/sys                  3072 0/0 Z 159 160 161
   24 100755  1 /bin/ps                   5408 0/0 Z 162 163 164 165 166 167
   25 100755  1 /bin/mknod                 656 0/0 Z 168
   26 100755  1 /bin/mkdir                 656 0/0 Z 169
   27 100755  1 /bin/mount                3824 0/0 Z 170 171 172 173
   28 100755  1 /bin/getty                5872 0/0 Z 174 175 176 177 178 179
   29 100755  1 /bin/rmdir                 592 0/0 Z 180
   30 100755  1 /bin/mv                   3488 0/0 Z 181 182 183 184
   31 100755  1 /bin/mkfat                4288 0/0 Z 185 186 187 188 189
   32 100755  1 /bin/ls                  10160 0/0 Z 190 191 192 193 194 195 196 198
   33 100755  1 /bin/cp                   6208 0/0 Z 201 202 203 204 205 206 207
   34 100755  1 /bin/sync                  144 0/0 Z 208
   35 100755  1 /bin/meminfo              4112 0/0 Z 209 210 211 212 213
   36 100755  1 /bin/chmod                 960 0/0 Z 214
   37 100755  1 /bin/rm                   4128 0/0 Z 215 216 217 218 219
   38 100755  1 /bin/login                6304 0/0 Z 220 221 222 223 224 225 226
   39 100755  1 /bin/clock                3792 0/0 Z 227 228 229 230
   40 100644  1 /bootopts                  472 0/0 Z 231
   41 040755  2 /etc                       224 0/0 Z 232
   42 100644  1 /etc/mount.cfg            1004 0/0 Z 233
   43 100644  1 /etc/group                 355 0/0 Z 234
   44 100644  1 /etc/rc.sys                666 0/0 Z 235
   45 100644  1 /etc/perror               1352 0/0 Z 236 237
   46 100644  1 /etc/profile               263 0/0 Z 238
   47 100644  1 /etc/issue                  13 0/0 Z 239
   48 100644  1 /etc/termcap               942 0/0 Z 240
   49 100644  1 /etc/net.cfg               992 0/0 Z 241
   50 100644  1 /etc/hosts                  89 0/0 Z 242
   51 100644  1 /etc/inittab               533 0/0 Z 243
...

What happens is that in order to traverse a path to open a file, like /dev/console is as follows: the root inode is already open, so the root directory is read, looking for 'dev', then its inode read to find the dev directory, then that inode read to find console, etc:

Name ---------------------- Inode   Block           Comments ----------
    / (superblock)          -       1
    /                       1       8               Read root dir
    /dev                    58      5,351           Dev inode > 31, dir block > 9
    /dev/console            99      7               Console inode > 96, should be < 32

    /bin                    3       8,4,10          Good inode and dir block
    /bin/init               5       4,12-18,20,19   Block 19 is indirect block
        /etc                41      8,5,232         Etc inode > 31, dir block > 11
        /etc/inittab        51      243

Tons of blocks being read and they're not close to each other. This invalidates the cache a lot and also seeks needlessly. We might be better off allocating 9-18k to extra EXT buffers.

Notice in this boot sequence of the kernel opening /dev/console, and running /bin/init, which opens /etc/inittab, all sorts of inefficiencies are found: The /dev and /etc inodes are > 31, which means that another whole block (and track invalidate/cache read) has to be done just to traverse the directory. The block allocated for each directory is also high, when it ought to be low, causing disk seeks. Execing /bin/init is larger than 7K, so a MINIX indirect block has to be allocated and read, just to read one more (final) block. Its a mess.

Here's what it could be, should mfs or the image builder be enhanced:

Should be:
    /                       1       8
    /dev                    2       9
    /etc                    3       10
    /bin                    4       11-12
    /root                   5       13

Summary All this is way too complicated to follow, I know. I have a kernel with debug statements to show exactly which blocks and inodes are accessed through system startup, along with track cache fillups, cache hits, buffer hits, etc. I'll report that later.

I can see how with only 4K bytes before having to split a block across a track, both invalidating the previous track and causing a full track read, the likely average real cache savings may be half of 4.5K, or around 2K. Pretty bad.

Regardless of track caching, the following would need to be addressed for a much faster boot:

The first (low numbered) inode block should not be split across tracks. This is a big problem not easily changed with the MINIX filesystem format. mfs could initially allocate inode numbers above 31 to solve it.
/dev, /etc, /bin and /root should be allocated first and be in the same inode block, with their directory blocks on the same track if possible.
mfs should allocate file data by skipping split block numbers until the 40k of disk is used.
Fast loading executables need to be 7k or less.

Handling split blocks begs another big question: Can I/O be scheduled for the subsequent track immediately after a track read without waiting another 200ms, that is, without another rotation?

Most floppies are formatted with a logical sector number that is not the physical order, right? This might allow a subsequent I/O to be scheduled and not miss sector 1. But does that mean that a "track read" doesn't actually read the sectors in physical 1-9 order? More research needed on the physical media side of things.

Mellvik commented 2 days ago

Wow, that's shipload of stuff to assimilate, isn't it? Very interesting indeed, even after having read it only once. Let me do that again to get some more. In the meanwhile:

Handling split blocks begs another big question: Can I/O be scheduled for the subsequent track immediately after a track read without waiting another 200ms, that is, without another rotation?

No, that won't work. OTOH, when the autoseek bit is set (it always is, and it doesn't mean autoseek, but 'continue reading/writing with the next head), you can read sect/head 8/0 and 0/1 in one operation, which the suggested 'sector cache' will do.

Most floppies are formatted with a logical sector number that is not the physical order, right?

No, the sectors are numbered sequentially - always, unless we're talking special purpose systems. Simply for compatibility. Sector skew may be created at format time, but it's messy and (obviously) requires a special formatter. OTOH, such skew can easily (well ...) be added into the lower levels of the file system, which would enable subsequent reads to reach fs-sequential blocks before they pass by the head. This is sort of messy too, but very flexible and the actuall skew could be saved in the superblock to ensure compatibility. UCSD pascal did this back in the day, I remember my facial expression when the setup went through the (8") disk performance test to get the optimal skew. Wow, what a difference! Which touches an important point: Drives are different and the optimal skew will differ from one drive to the next - not within a track but as soon as step/seek is involved. IOW - this one is too messy to.

BTW - I was just running tests on startup time off of a 1.44M floppy using BIOS IO on the 286. I haven't run BIOS IO for a long time, and first impression - it's very slow compared to the directed driver. That aside, the numbers are (kernel start to Getty running): No track cache: 3186 jiffies, 4.5k track cache: 1890 jiffies, 9k track cache: 1773 jiffies. More numbers are coming, it will be interesting to see the same number (from exactly the same setup) using the direct driver, and to vary the cache size more.

BTWII: What time zone are you on? This is an unusual time to get a message from you?

Mellvik commented 1 day ago

@ghaerr, your numbers and analysis are intriguing - this is a great piece of work. It seems to me - as you've already suggested- that mfs trickery can make a big difference in startup speed.

while awaiting your next installment, I've turned the directfd track buffer into a working sector cache, numbers coming tomorrow. Using this cache, track-spanning blocks may be 'hidden' by multitrack reads: if sector 9, hd 0 is requested (9 SPT) and the cache flushes, the read starts a sector 9 and continues on the other side until the cache is full. By ensuring that the cache is an even number of bytes, sec9/hd0 will never be read alone.

Whether this has any bearing on performance we'll know tomorrow.

the speed difference I mentioned before between bios fd and direct fd is likey my mind playing tricks on me. Again, firm numbers tomorrow.

ghaerr commented 1 day ago

What time zone are you on?

Yes, I am in the eurozone traveling! A bit jet lagged at the moment, so the ongoing analysis may get a bit delayed lol.

you can read sect/head 8/0 and 0/1 in one operation, which the suggested 'sector cache' will do.

Very interesting - is that a NEC 765 (early FDC) capability, or only on later chipsets?

What command bit specifically enables/disables this?

when the autoseek bit is set (it always is, and it doesn't mean autoseek, but 'continue reading/writing with the next head)

I see. Is there a separate bit for actual seeking, or is the general capability only that of continuing the multi sector read/write operation onto the next head (specifically, from 0 to 1 only I would imagine)?

This brings up some big differences between the BIOS driver (on ELKS at least) and the DF driver: long ago, it was determined that ELKS required using a DDPT change in order for the BIOS not to inadvertently "seek" past the end sector during an I/O request, in the case the BIOS thought differently about the current floppy format than ELKS. IIRC it was also decided that it was too risky to assume that the BIOS or FDC hardware supported this capability, so, although the DDPT is always used to stop head/cylinder auto-advancement, the end sector is always the last on the track, same head, guarantee a read to end of track when BIOS/ELKS differ on floppy type.

With Direct Floppy driver, I think the driver on both ELKS and TLVC may use autoseek, as none of your cache code was changed for ELKS. This changes the results of my analysis heavily, as split blocks may not be a problem.

By ensuring that the cache is an even number of bytes, sec9/hd0 will never be read alone.

That's a really great point, so we need to ensure that the track cache has an even number of sectors (which is not currently the case if set to 9 from 18 sectors)!

When looking through the BIOS driver code, I noticed that the capability of "reading a full track" vs "reading requested sector to end of track" is optionable via a compile-time FULL_TRACK flag (default OFF). I'm thinking as a result of this discussion that there may want to be a AUTO_SEEK flag as well which would allow for head-advance autoseek to be performed for those with BIOS/FDCs that support it. This would likely have to be default OFF due to the inability to boot otherwise. This option would only be for the BIOS driver, unless there are FDCs that don't support auto-seek, where it might be needed for DF.

ghaerr commented 1 day ago

No, the sectors are numbered sequentially - always

I see. I was thinking the DOS format software might have had capabilities to number them differently, which of course would change our analysis by quite a bit.

such skew can easily (well ...) be added into the lower levels of the file system

That's a very interesting idea. Probably overkill and complicated, given that mfs can probably solve most of the biggest problems. (And no, I don't want to write an ELKS/TVLC defrag utility!).

No track cache: 3186 jiffies, 4.5k track cache: 1890 jiffies, 9k track cache: 1773 jiffies.

Wow! That seems to show a big difference in boot time between track cache and none, while very little difference between cache sizes. I will work to better formalize my test kernel that records cache reads/hits/misses etc and look forward to your results. It would be interesting to learn whether the autoseek mechanism changes timing much when on vs off as well.

I haven't run BIOS IO for a long time, and first impression - it's very slow compared to the directed driver. T

Are you talking boot times, or overall? During boot, there's not much room to allow other processes to run async, except for net start related stuff. I'd be interested in hearing more about this. We still default to the BIOS driver in ELKS, probably for compatibility reasons. Perhaps now that v0.8.0 is out that should be changed, especially if we get more fine-grained control over track caching, no DDPTs and auto seek behavior, all resulting in higher speed.

BTW, I plan on producing an mfs enhancement as a result of this. I don't have a cycle-correct emulator like MartyPC running on macOS (due to them not having a binary download and I don't have Rust installed, etc) so there isn't an easy way for me to measure floppy delays. I was thinking of adding artificial floppy delays as an option for QEMU, what do you think of that approach? It would be nice to somehow measure differing floppy image formats without having to use real hardware.

Mellvik commented 1 day ago

you can read sect/head 8/0 and 0/1 in one operation, which the suggested 'sector cache' will do.

Very interesting - is that a NEC 765 (early FDC) capability, or only on later chipsets?

This has been in the 765 from the beginning.

What command bit specifically enables/disables this?

I see. Is there a separate bit for actual seeking, or is the general capability only that of continuing the multi sector read/write operation onto the next head (specifically, from 0 to 1 only I would imagine)?

There is no autoseek as in moving the head unless we go to the later gen chips (8207x). It's being turned on in the directed driver, but there were some hiccups with QEMU - off hand I don't recall how we handled that.

This brings up some big differences between the BIOS driver (on ELKS at least) and the DF driver: long ago, it was determined that ELKS required using a DDPT change in order for the BIOS not to inadvertently "seek" past the end sector during an I/O request, in the case the BIOS thought differently about the current floppy format than ELKS. IIRC it was also decided that it was too risky to assume that the BIOS or FDC hardware supported this capability, so, although the DDPT is always used to stop head/cylinder auto-advancement, the end sector is always the last on the track, same head, guarantee a read to end of track when BIOS/ELKS differ on floppy type.

Yes, this does sound vaguely familiar. I would have to go back to the thread to actually recall the conclusion. But given that the FDC has always had the ability to read a full cylinder in one, the logical (!) conclusion is that that capability is also in the BIOS.

With Direct Floppy driver, I think the driver on both ELKS and TLVC may use autoseek, as none of your cache code was changed for ELKS. This changes the results of my analysis heavily, as split blocks may not be a problem.

I believe this is correct. I remember changing the track cache code in the driver to always read an extra sector if the sector count (per track) was odd and head = 0. That's probably in the version you ported across. The new version is a lot simpler and has a lot more flexibility. [It's amazing how thinking differently sometimes changes (simplifies) the entire scenario.] Reading now always starts at the requested sector and fills the cache from that point to the end of the cylinder or the end of the buffer. Not good for small floppies and a big (9k) buffer, but good for testing. I'm adding a flexible (sysctl) setting for the usable cache size. It's going to be really interesting to see how that works out, incrementing by 1k from 1 and up - in different drive types, measuring boot time (actually system startup time to be exact).

Mellvik commented 1 day ago

No track cache: 3186 jiffies, 4.5k track cache: 1890 jiffies, 9k track cache: 1773 jiffies.

Wow! That seems to show a big difference in boot time between track cache and none, while very little difference between cache sizes. I will work to better formalize my test kernel that records cache reads/hits/misses etc and look forward to your results. It would be interesting to learn whether the autoseek mechanism changes timing much when on vs off as well.

The timing (that is, the minimal difference between 4.5 and 9k) surprised me too! Makes me even more curious about the upcoming testing. I suspect (I know I shouldn't) that the optimal cache size (boot and general use) may be 4k... I'm not sure it makes sense to spend time testing without the 'autoseek' unless you need some numbers to compare with the BIOS variant.

I haven't run BIOS IO for a long time, and first impression - it's very slow compared to the directed driver. T

Are you talking boot times, or overall? During boot, there's not much room to allow other processes to run async, except for net start related stuff. I'd be interested in hearing more about this. We still default to the BIOS driver in ELKS, probably for compatibility reasons. Perhaps now that v0.8.0 is out that should be changed, especially if we get more fine-grained control over track caching, no DDPTs and auto seek behavior, all resulting in higher speed.

I'm retracting that - pending more testing. The first test with the new sector cache showed only marginal difference (system load time) compared to the BIOS fd number. As to ELKS and the direct driver, it sounds like a good idea - but then again, I'm not neutral when having that opinion. There is no question that using the BIOS driver saves RAM. OTOH I tell you, getting rid of theDDPT stuff is a real blessing.

BTW, I plan on producing an mfs enhancement as a result of this.

This is GREAT! I'm very much looking forward to it and it sounds like you have a perfect debug setup for it.

I don't have a cycle-correct emulator like MartyPC running on macOS (due to them not having a binary download and I don't have Rust installed, etc) so there isn't an easy way for me to measure floppy delays. I was thinking of adding artificial floppy delays as an option for QEMU, what do you think of that approach? It would be nice to somehow measure differing floppy image formats without having to use real hardware.

The experience with QEMU and timing indicates (to me) that it's never going to be very dependable. I'd suggest running 86Box instead, which seems to me to be maybe not cycle-correct, I don't know about that, ut speed correct - which may imply the first. And there is a 'manual' for it in ELKS already IIRC.

Mellvik commented 2 hours ago

@ghaerr, some interesting numbers - not much screen time these last few days plus a couple of bug hunts in the new sector cache code and some hardware issues. Anyway, the numbers are 'startup times' - from just after mount_root() in init/main.c:

    mount_root();       

#ifdef BOOT_TIMER       /* temporary, works with similar printout in getty */
    printk("[%lu]", jiffies);   /* for measuring startup time */
#endif

... to the startup of the first getty. As I was reminded when the numbers didn't add up, the environment is 'fragile' in a timing sense. Editing (i.e. moving) or replacing a file that is part of startup on the floppy, say fsck or mount.cfg, may cause big changing in timing. Just demonstrating the importance of what you've been looking at recently.

Anyway, more testing is needed - this is from the 286/12.5MHz compaq, booting off of a 1.44M floppy. The buffer available (DMASEGSZ) is 9k, I'm changing the size of the sector cache via bootopts not having implemented sysctlyet, always using full blocks - 1k thru 9k.

The most interesting discovery is that 6k and 9k delivers about the same performance (repeated testing shows 6k ahead more often than not)! 8k comes in 5% behind these two. 6K (and 8k and 9k) cache is more than double the speed of no cache. More numbers are coming, except for 1 and 9 I haven tested odd block sizes yet. [EDIT: odd counts added - and we have a new winner :-) -- 7 blocks! And some interesting variations, such as 3 blocks is faster than 4. Time will show how 'normal use' behaves with varying block sizes.]

cache  |   startup time (jiffies) 
size   |  [includes new startup, fsck (no check) on root, getty ]
-------------------------------------------------------------------
1      |    3559   (-83%)
2      |    2763   (-42%)
3      |    2246   (-16%)
4      |    2284   (-17%)
5      |    2108   (-8%)
6      |    2011   (-3%)
7      |    1944
8      |    2126   (-9%)
9      |    2048   (-5%)
NC     |    3559   (-83%)
------------------------------------------------------------------
NC -> No CONFIG_TRACK_CACHE

It will be really interesting to see how this compares with the 360 and 720 drives - and the 1.2M of course.