Closed mateuszviste closed 6 days ago
This should be the relevant call stack involved in this
func3f
fdos_handle
fdos_crit
fdos_nocrit
fdos_entry
fdos_read
fdosrw
deblock_rw
locate_buffer
Quite some room for something to go wrong. On reads smaller than a sector the EDR kernel reads into one of its buffers and does a "deblocking", meaning that it copies the relevant part of the read rector from the buffer to user space. First thing we should check if locate_buffer
does indeed return already read sector data or if this fails for some reason.
@mateuszviste can you please do a 512 byte read comparison? EDR should then use a different code path, doing direct reads to user space buffer.
tim512 says:
EDR = 1.8s
FreeDOS = 1.6s
FreeDOS is still better, but only by some 10% now.
attached, just in case: tim512.zip
Thanks, I would not mind the 10% difference for the time being. So this strongly indicates that it is indeed some buffering / deblocking issue. I will try to find out whats going on with the deblocking / buffering code...
Block cache seems not to cause this slowdown. Have not compared the sector numbers of the sectors read, but when loading the PL translation, less than ~25 sector reads are performed, and between the reads several seconds pass. 25 sectors looks like a reasonable number taking the size of the translation into account...
I have to think about how to profile the kernel in general. I'd love to have something like Watcom wsample / wprof or Borland tprof for profiling the kernel. But these tools do not work for the kernel.
I consider making a special build of the kernel by utilizing the RDTSC x86 instruction (Pentium+), and use this to incrementally find out how much cycles the different kernel routines take, perhaps aided by some macros. But I never used RDTSC, so have to make a few test runs...
Maybe it's too primitive, but what about adding some one-byte variable in memory (seg 50h), and make every routine of the kernel change this variable when the routine is entered? Perhaps this could be automated through some macro that overloads the call instruction. Then, a TSR would run the PIT at 1000hz and frequently poll this shared variable, identifying what part of the kernel is being executed at the given moment. This information would have to be written somewhere in some array with counters, so it would be easy to see where statistically the CPU spends the most of time.
No bad idea with the variable at 50h. But if we start thinking about a TSR we could also record CS:IPs and matching them against a memory map of the kernel, so we know exactly which functions were called without having to manually inserting code for setting the variable at 50h with different values.
And surprisingly I found TSC functionality already built into the kernel, have to test if this works :) :
I figured out that on an emulated Pentium, when calling kernel function 39h (read), the routines fdrw_prepare
and fdrw_seek
both account for 40% of the runtime, accumulating to 80%. These are called by fdosrw
. This is for buffered reads with the sector in the deblocking cache! Next would be to analyze these functions for performance bottlenecks.
Here are the significant TSC differences for the func39 call stack (I left out the unimportant ones):
func39: 202f
fdosrw: 1A59
fdrw_prepare: 0C2A
div64: 0A4D
fdrw_size: 0075
deblock_rw: 0079
fdrw_seek: 0D15
div64: 0A4D
As you may notice the div64
routine is at least on the 86box pentium box slow as hell. As the 8086 does not have a div64 instruction, this is implemented in assembly. About 80% of the runtime of fdrw_seek and fdrw_prepare is because of div64. So if we want to speed things up, we should first try to improve this, if its possible:
I did a very naive optimisation, as I noticed that the stacks for div64
and div32
are the same: https://hg.pushbx.org/ecm/edrdos/rev/374911957af5 Please test!
It could be further optimised if the divisor is known to fit in 16-bits, with a special code path that does chained 32 to 16 bit division often enough to divide an entire 64-bit dividend.
Can you tell me how to use the TSC handling?
I did a very naive optimisation, as I noticed that the stacks for
div64
anddiv32
are the same: https://hg.pushbx.org/ecm/edrdos/rev/374911957af5 Please test!
SUCCESS! Down to 017A "TSC". Lets make a unit out of it :)
Can you tell me how to use the TSC handling?
I call read_tsc
to start the measurement and call diff_tsc
to stop and print the diff as a hex value. The routines are in bdevio.asm. Would be awesome to have this in lDebug in the sense that lDebug would print the TSC diff after a "step into and returning".
Does everyone see the big smile on my face? :D
Can you tell me how to use the TSC handling?
I call
read_tsc
to start the measurement and calldiff_tsc
to stop and print the diff as a hex value. The routines are in bdevio.asm. Would be awesome to have this in lDebug in the sense that lDebug would print the TSC diff after a "step into and returning".
I'm still confused, which is no good for adding it to the debugger (either internal or in an extension).
The numbers in that reply, how did you obtain them? Run one function at a time, with calls to the TSC handlers before and after? So you had to edit the source 8 times?
Here are the significant TSC differences for the func39 call stack (I left out the unimportant ones):
func39: 202f
fdosrw: 1A59
fdrw_prepare: 0C2A
div64: 0A4D
fdrw_size: 0075
deblock_rw: 0079
fdrw_seek: 0D15
div64: 0A4D
So you had to edit the source 8 times?
That is correct.
Does everyone see the big smile on my face? :D
So down from 146s to 43s. That's fantastic! Even slightly better than FreeDOS. :-D
Does everyone see the big smile on my face? :D
So down from 146s to 43s. That's fantastic! Even slightly better than FreeDOS. :-D
Where did you get the 146s from? Was that the tim2 result? What is the number for FreeDOS?
On the machine I tested it is FreeDOS 48.2s, and EDR 43s. I think there is still room for improvement, but currently its a pain in the ass profiling it. I have to think of a better way testing this before making further improvements.
The 146s was Mateusz initial testing with the EDR kernel, see https://github.com/SvarDOS/bugz/issues/127#issuecomment-2388606676
So closing this for now, as we are not slower than FreeDOS anymore.
Did you use tscsel at all?
No.
I added a very simple tsc.eld today. You can build it with ldebug/source/eld$ ./makone.sh tsc
(NASM + bash) or wait for the daily build before midnight +0200 today.
This only supports one command in two modes, and one 32-bit variable. The ELD may be installed residently using ext tsc.eld install
(and uninstalled using tsc uninstall
). The tsc
command displays the current, 64-bit TSC in the format XXXXYYYY_ZZZZAAAA
. If it is resident it also writes the lower 32 bits into the lasttsc
variable. A tsc quiet
command only writes into the variable, without display, and thus is only useful when the ELD is residently installed.
It should detect at install / ELD load time when no TSC is available on the machine.
It can be used in a command chain like re.replace @tsc quiet; @h lasttsc - v0; @r v0 = lasttsc; @r
and then p 10
for instance will show the differences after the first step. (Not for the first step itself.)
However, this is barely useful as is because it can only run as a debugger command. That means the great majority of the time spent between two TSC commands goes into running the debugger. On our amd64 server in dosemu2 (no KVM, 86 mode or native DPMI mode) running a single instruction (like mov
or nop
) takes more than 3 million TSC ticks. An int 21h
call takes less than 8 million ticks. So it is clearly not useful.
If you're interested I may add new ELD hooks into the debugger so as to run an ELD close to the point that the debugger runs debuggee code (in the run
function) and shortly after debuggee code breaks back into the debugger (intrtn
family). This may allow the TSC ELD to gather more useful data. Although, it will still run dozens of instructions in the debugger, necessarily. I'd have to patch the debugger itself to run rdtsc
to lower the overhead even more.
I will also add more variables, probably by multiplexing the single variable as if it was an "array".
That means the great majority of the time spent between two TSC commands goes into running the debugger. On our amd64 server in dosemu2 (no KVM, 86 mode or native DPMI mode) running a single instruction (like
mov
ornop
) takes more than 3 million TSC ticks. Anint 21h
call takes less than 8 million ticks. So it is clearly not useful.
Thanks for looking at this! Sadly as it is now this contradicts my indended use case of measuring the TSC for single function calls or even a few instructions. Could be circumvented by altering the debugee to run the code to be tested insude a loop multiple times for measurement. But my hope is to use this measurement facility without having to alter the binary.
I understand that would have to be implemented deep inside the debugger. So I am fine if this is not going to happen...
I'm looking at the source and this is where run transfers control to the debuggee and this is where control is back in the debugger. Depends on if you want to have DDebug/CDebug run the hook with their handlers (like the debuggee, closer to debuggee time) or with the outer debugger's handler (more overhead but allows debugging the hook).
But yeah, there's certainly some overhead left. The lowest we could do would run very close to the iret and early in the int1 or the int3 handlers. Forget about the ELD then, because all conventional ELD interfaces want to run on our stack and have the PM/86M dual mode model set up. (Or add half a dozen new ELD hooks with very particular environmental requirements?)
I'm seriously considering just throwing more interfaces at it. Worst case we can disable them at build time. There is a precedent: The AMIS ELD interface always runs in Real or Virtual 86 Mode and cannot make use of most debugger interfaces.
Perhaps the debugger could do a calibration by "dry running" a TSC measurement of a single instruction and using this TSC as an overhead value?
That's a fair idea but I'd have to study / test this to see how well it does. It probably could be combined with any of these choices though.
For me that has not to be exact to the number. But the order of magnitude should be correct :)
Yeah. I am considering another path: We could insert a patch area, eg 32 nop
instructions, into some of the handlers. Then an ELD could write the code to run rdtsc
right there, needing only the (writable) ds
segment/selector. Eg:
push eax ; 2
push edx ; 2
rdtsc ; 2
mov dword [var], eax ; 4
mov dword [var], edx ; 5
pop edx ; 2
pop eax ; 2
; 19 bytes
Then an ELD could write the code to run
rdtsc
right there, needing only the (writable)ds
segment/selector.
Probably more accurate than doing the calibration thing.
Then an ELD could write the code to run
rdtsc
right there, needing only the (writable)ds
segment/selector.Probably more accurate than doing the calibration thing.
Might still look into it.
I did implement two patch areas now. For now one of them is only used in 86 Mode, might add a corresponding PM one as another patch area later. After installing ext tsc.eld install
which needs today's new build of the debugger, it will write the before run TSC into lasttsc3:lasttsc2
and the after run TSC into lasttsc5:lasttsc4
. lasttsc1:lasttsc0
continues to be used for explicit tsc
command results.
To make use of the variables here's a small scriptlet I improvised. It doesn't correctly subtract the 64-bit TSC variable pairs but usually the result will be correct.
re.replace @r; @ext printf.eld \"diff=%02X_%08X after=%X_%08X\\r\\n\" lasttsc5 - lasttsc3, lasttsc4 - lasttsc2, lasttsc5, lasttsc4
This gives us results below 64 kilo binary for single instructions being traced. So it's much better than the prior approach, but not sure how close it gets to being useful.
Varies between 1800h and 6000h TSC ticks to trace a NOP.
Varies between 1800h and 6000h TSC ticks to trace a NOP.
Thats quite a bit. Makes benchmarking a few hundred instructions being way below the variance or "noise floor".
Any idea where this variance comes from? Enabled interrupts or something?
No idea. IRQs could play a role. I see the same TSC stamp in 86 Mode as in native DPMI so the non-KVM 86 Mode emulation probably implements rdtsc
by running the instruction on the host. So thread scheduling on the host could also play a role in this.
Do these variances also occur if using RDTSC directly in some code in QEMU? The read_tsc / diff_tsc combination results of the kernel were quite stable when running under 86Box.
Do these variances also occur if using RDTSC directly in some code in QEMU? The read_tsc / diff_tsc combination results of the kernel were quite stable when running under 86Box.
The single instruction trace takes below 2_0000h ticks there. A DOS call takes about between 4_0000h and 5_0000h. This is also without KVM.
This has been discovered during https://github.com/SvarDOS/bugz/issues/127 - here is a bootable floppy with a set of programs named tim2.exe, tim4.exe, tim8.exe, ... tim128.exe. each of these program do the same thing: loads a data block from the FDISK.LNG file and measures how long it took. The difference is that each use a differently sized buffer to perform the DOS reads: tim2 uses a 2-bytes buffer, tim4 4 bytes, etc.
All tests have been performed under 86box with an PC XT running at 4.77 MHz and 256K RAM.
Bootable 360K test floppy: test.zip
The test floppy has the FreeDOS kernel on it. If substituting it for the EDR kernel, then read performance becomes about 2x worse (more for small buffers, less for bigger buffers).
To perform the test you need to run "timX pl" where X is the buffer size (2,4,8,16,32,64,128).
Example:
tim4 pl
on FreeDOS takes 23 seconds:Same program on EDR-DOS takes 88 seconds: