Performance issue when reading small blocks of data

mateuszviste commented 1 week ago

This has been discovered during https://github.com/SvarDOS/bugz/issues/127 - here is a bootable floppy with a set of programs named tim2.exe, tim4.exe, tim8.exe, ... tim128.exe. each of these program do the same thing: loads a data block from the FDISK.LNG file and measures how long it took. The difference is that each use a differently sized buffer to perform the DOS reads: tim2 uses a 2-bytes buffer, tim4 4 bytes, etc.

All tests have been performed under 86box with an PC XT running at 4.77 MHz and 256K RAM.

Bootable 360K test floppy: test.zip

The test floppy has the FreeDOS kernel on it. If substituting it for the EDR kernel, then read performance becomes about 2x worse (more for small buffers, less for bigger buffers).

To perform the test you need to run "timX pl" where X is the buffer size (2,4,8,16,32,64,128).

Example:

tim4 pl on FreeDOS takes 23 seconds:

Same program on EDR-DOS takes 88 seconds:

boeckmann commented 1 week ago

This should be the relevant call stack involved in this

func3f
  fdos_handle
    fdos_crit
      fdos_nocrit
        fdos_entry
          fdos_read
            fdosrw
              deblock_rw
                locate_buffer

Quite some room for something to go wrong. On reads smaller than a sector the EDR kernel reads into one of its buffers and does a "deblocking", meaning that it copies the relevant part of the read rector from the buffer to user space. First thing we should check if locate_buffer does indeed return already read sector data or if this fails for some reason.

boeckmann commented 1 week ago

@mateuszviste can you please do a 512 byte read comparison? EDR should then use a different code path, doing direct reads to user space buffer.

mateuszviste commented 1 week ago

tim512 says:

EDR = 1.8s

FreeDOS = 1.6s

FreeDOS is still better, but only by some 10% now.

attached, just in case: tim512.zip

boeckmann commented 1 week ago

Thanks, I would not mind the 10% difference for the time being. So this strongly indicates that it is indeed some buffering / deblocking issue. I will try to find out whats going on with the deblocking / buffering code...

boeckmann commented 1 week ago

Block cache seems not to cause this slowdown. Have not compared the sector numbers of the sectors read, but when loading the PL translation, less than ~25 sector reads are performed, and between the reads several seconds pass. 25 sectors looks like a reasonable number taking the size of the translation into account...

boeckmann commented 1 week ago

I have to think about how to profile the kernel in general. I'd love to have something like Watcom wsample / wprof or Borland tprof for profiling the kernel. But these tools do not work for the kernel.

I consider making a special build of the kernel by utilizing the RDTSC x86 instruction (Pentium+), and use this to incrementally find out how much cycles the different kernel routines take, perhaps aided by some macros. But I never used RDTSC, so have to make a few test runs...

mateuszviste commented 1 week ago

Maybe it's too primitive, but what about adding some one-byte variable in memory (seg 50h), and make every routine of the kernel change this variable when the routine is entered? Perhaps this could be automated through some macro that overloads the call instruction. Then, a TSR would run the PIT at 1000hz and frequently poll this shared variable, identifying what part of the kernel is being executed at the given moment. This information would have to be written somewhere in some array with counters, so it would be easy to see where statistically the CPU spends the most of time.

boeckmann commented 1 week ago

No bad idea with the variable at 50h. But if we start thinking about a TSR we could also record CS:IPs and matching them against a memory map of the kernel, so we know exactly which functions were called without having to manually inserting code for setting the variable at 50h with different values.

And surprisingly I found TSC functionality already built into the kernel, have to test if this works :) :

https://github.com/SvarDOS/edrdos/blob/7d3cf5610508352513c581962ca63ee0bf67a7ad/drdos/bdevio.asm#L2185-L2259

boeckmann commented 1 week ago

I figured out that on an emulated Pentium, when calling kernel function 39h (read), the routines fdrw_prepare and fdrw_seek both account for 40% of the runtime, accumulating to 80%. These are called by fdosrw. This is for buffered reads with the sector in the deblocking cache! Next would be to analyze these functions for performance bottlenecks.

boeckmann commented 1 week ago

Here are the significant TSC differences for the func39 call stack (I left out the unimportant ones):

func39:           202f
  fdosrw:         1A59
    fdrw_prepare: 0C2A
      div64:      0A4D
    fdrw_size:    0075
    deblock_rw:   0079
    fdrw_seek:    0D15
      div64:      0A4D

As you may notice the div64 routine is at least on the 86box pentium box slow as hell. As the 8086 does not have a div64 instruction, this is implemented in assembly. About 80% of the runtime of fdrw_seek and fdrw_prepare is because of div64. So if we want to speed things up, we should first try to improve this, if its possible:

https://github.com/SvarDOS/edrdos/blob/7d3cf5610508352513c581962ca63ee0bf67a7ad/drdos/bdevio.asm#L1965-L2007

ecm-pushbx commented 1 week ago

I did a very naive optimisation, as I noticed that the stacks for div64 and div32 are the same: https://hg.pushbx.org/ecm/edrdos/rev/374911957af5 Please test!

It could be further optimised if the divisor is known to fit in 16-bits, with a special code path that does chained 32 to 16 bit division often enough to divide an entire 64-bit dividend.

ecm-pushbx commented 1 week ago

Can you tell me how to use the TSC handling?

boeckmann commented 1 week ago

I did a very naive optimisation, as I noticed that the stacks for div64 and div32 are the same: https://hg.pushbx.org/ecm/edrdos/rev/374911957af5 Please test!

SUCCESS! Down to 017A "TSC". Lets make a unit out of it :)

Can you tell me how to use the TSC handling?

I call read_tsc to start the measurement and call diff_tsc to stop and print the diff as a hex value. The routines are in bdevio.asm. Would be awesome to have this in lDebug in the sense that lDebug would print the TSC diff after a "step into and returning".

https://github.com/SvarDOS/edrdos/blob/7d3cf5610508352513c581962ca63ee0bf67a7ad/drdos/bdevio.asm#L2185-L2259

boeckmann commented 1 week ago

Does everyone see the big smile on my face? :D

Bildschirmfoto_20241002_224240

ecm-pushbx commented 1 week ago

Can you tell me how to use the TSC handling?

I call read_tsc to start the measurement and call diff_tsc to stop and print the diff as a hex value. The routines are in bdevio.asm. Would be awesome to have this in lDebug in the sense that lDebug would print the TSC diff after a "step into and returning".

I'm still confused, which is no good for adding it to the debugger (either internal or in an extension).

The numbers in that reply, how did you obtain them? Run one function at a time, with calls to the TSC handlers before and after? So you had to edit the source 8 times?

Here are the significant TSC differences for the func39 call stack (I left out the unimportant ones):

func39:           202f
  fdosrw:         1A59
    fdrw_prepare: 0C2A
      div64:      0A4D
    fdrw_size:    0075
    deblock_rw:   0079
    fdrw_seek:    0D15
      div64:      0A4D

boeckmann commented 1 week ago

So you had to edit the source 8 times?

That is correct.

mateuszviste commented 1 week ago

Does everyone see the big smile on my face? :D

So down from 146s to 43s. That's fantastic! Even slightly better than FreeDOS. :-D

ecm-pushbx commented 1 week ago

Does everyone see the big smile on my face? :D

So down from 146s to 43s. That's fantastic! Even slightly better than FreeDOS. :-D

Where did you get the 146s from? Was that the tim2 result? What is the number for FreeDOS?

boeckmann commented 1 week ago

On the machine I tested it is FreeDOS 48.2s, and EDR 43s. I think there is still room for improvement, but currently its a pain in the ass profiling it. I have to think of a better way testing this before making further improvements.

boeckmann commented 1 week ago

The 146s was Mateusz initial testing with the EDR kernel, see https://github.com/SvarDOS/bugz/issues/127#issuecomment-2388606676

boeckmann commented 6 days ago

So closing this for now, as we are not slower than FreeDOS anymore.

ecm-pushbx commented 6 days ago

Did you use tscsel at all?

boeckmann commented 6 days ago

No.

ecm-pushbx commented 5 days ago

I added a very simple tsc.eld today. You can build it with ldebug/source/eld$ ./makone.sh tsc (NASM + bash) or wait for the daily build before midnight +0200 today.

This only supports one command in two modes, and one 32-bit variable. The ELD may be installed residently using ext tsc.eld install (and uninstalled using tsc uninstall). The tsc command displays the current, 64-bit TSC in the format XXXXYYYY_ZZZZAAAA. If it is resident it also writes the lower 32 bits into the lasttsc variable. A tsc quiet command only writes into the variable, without display, and thus is only useful when the ELD is residently installed.

It should detect at install / ELD load time when no TSC is available on the machine.

It can be used in a command chain like re.replace @tsc quiet; @h lasttsc - v0; @r v0 = lasttsc; @r and then p 10 for instance will show the differences after the first step. (Not for the first step itself.)

However, this is barely useful as is because it can only run as a debugger command. That means the great majority of the time spent between two TSC commands goes into running the debugger. On our amd64 server in dosemu2 (no KVM, 86 mode or native DPMI mode) running a single instruction (like mov or nop) takes more than 3 million TSC ticks. An int 21h call takes less than 8 million ticks. So it is clearly not useful.

If you're interested I may add new ELD hooks into the debugger so as to run an ELD close to the point that the debugger runs debuggee code (in the run function) and shortly after debuggee code breaks back into the debugger (intrtn family). This may allow the TSC ELD to gather more useful data. Although, it will still run dozens of instructions in the debugger, necessarily. I'd have to patch the debugger itself to run rdtsc to lower the overhead even more.

I will also add more variables, probably by multiplexing the single variable as if it was an "array".

boeckmann commented 5 days ago

That means the great majority of the time spent between two TSC commands goes into running the debugger. On our amd64 server in dosemu2 (no KVM, 86 mode or native DPMI mode) running a single instruction (like mov or nop) takes more than 3 million TSC ticks. An int 21h call takes less than 8 million ticks. So it is clearly not useful.

Thanks for looking at this! Sadly as it is now this contradicts my indended use case of measuring the TSC for single function calls or even a few instructions. Could be circumvented by altering the debugee to run the code to be tested insude a loop multiple times for measurement. But my hope is to use this measurement facility without having to alter the binary.

I understand that would have to be implemented deep inside the debugger. So I am fine if this is not going to happen...

ecm-pushbx commented 5 days ago

I'm looking at the source and this is where run transfers control to the debuggee and this is where control is back in the debugger. Depends on if you want to have DDebug/CDebug run the hook with their handlers (like the debuggee, closer to debuggee time) or with the outer debugger's handler (more overhead but allows debugging the hook).

But yeah, there's certainly some overhead left. The lowest we could do would run very close to the iret and early in the int1 or the int3 handlers. Forget about the ELD then, because all conventional ELD interfaces want to run on our stack and have the PM/86M dual mode model set up. (Or add half a dozen new ELD hooks with very particular environmental requirements?)

ecm-pushbx commented 5 days ago

I'm seriously considering just throwing more interfaces at it. Worst case we can disable them at build time. There is a precedent: The AMIS ELD interface always runs in Real or Virtual 86 Mode and cannot make use of most debugger interfaces.

boeckmann commented 5 days ago

Perhaps the debugger could do a calibration by "dry running" a TSC measurement of a single instruction and using this TSC as an overhead value?

ecm-pushbx commented 5 days ago

That's a fair idea but I'd have to study / test this to see how well it does. It probably could be combined with any of these choices though.

boeckmann commented 5 days ago

For me that has not to be exact to the number. But the order of magnitude should be correct :)

ecm-pushbx commented 5 days ago

Yeah. I am considering another path: We could insert a patch area, eg 32 nop instructions, into some of the handlers. Then an ELD could write the code to run rdtsc right there, needing only the (writable) ds segment/selector. Eg:

  push eax ; 2
  push edx ; 2
  rdtsc ; 2
  mov dword [var], eax ; 4
  mov dword [var], edx ; 5
  pop edx ; 2
  pop eax ; 2
  ; 19 bytes

boeckmann commented 5 days ago

Then an ELD could write the code to run rdtsc right there, needing only the (writable) ds segment/selector.

Probably more accurate than doing the calibration thing.

ecm-pushbx commented 5 days ago

Then an ELD could write the code to run rdtsc right there, needing only the (writable) ds segment/selector.

Probably more accurate than doing the calibration thing.

Might still look into it.

I did implement two patch areas now. For now one of them is only used in 86 Mode, might add a corresponding PM one as another patch area later. After installing ext tsc.eld install which needs today's new build of the debugger, it will write the before run TSC into lasttsc3:lasttsc2 and the after run TSC into lasttsc5:lasttsc4. lasttsc1:lasttsc0 continues to be used for explicit tsc command results.

To make use of the variables here's a small scriptlet I improvised. It doesn't correctly subtract the 64-bit TSC variable pairs but usually the result will be correct.

re.replace @r; @ext printf.eld \"diff=%02X_%08X after=%X_%08X\\r\\n\" lasttsc5 - lasttsc3, lasttsc4 - lasttsc2, lasttsc5, lasttsc4

This gives us results below 64 kilo binary for single instructions being traced. So it's much better than the prior approach, but not sure how close it gets to being useful.

ecm-pushbx commented 5 days ago

Varies between 1800h and 6000h TSC ticks to trace a NOP.

boeckmann commented 5 days ago

Varies between 1800h and 6000h TSC ticks to trace a NOP.

Thats quite a bit. Makes benchmarking a few hundred instructions being way below the variance or "noise floor".

boeckmann commented 5 days ago

Any idea where this variance comes from? Enabled interrupts or something?

ecm-pushbx commented 5 days ago

No idea. IRQs could play a role. I see the same TSC stamp in 86 Mode as in native DPMI so the non-KVM 86 Mode emulation probably implements rdtsc by running the instruction on the host. So thread scheduling on the host could also play a role in this.

boeckmann commented 5 days ago

Do these variances also occur if using RDTSC directly in some code in QEMU? The read_tsc / diff_tsc combination results of the kernel were quite stable when running under 86Box.

ecm-pushbx commented 5 days ago

Do these variances also occur if using RDTSC directly in some code in QEMU? The read_tsc / diff_tsc combination results of the kernel were quite stable when running under 86Box.

The single instruction trace takes below 2_0000h ticks there. A DOS call takes about between 4_0000h and 5_0000h. This is also without KVM.

SvarDOS / edrdos

Performance issue when reading small blocks of data #116