Incorrect timing of queued register writes in ymf262

M-HT commented 5 months ago

When you queue some register writes (aymo_ymf262_enqueue_write) before reading the output (aymo_ymf262_generate_i16x2), there is a difference in timings between nuked opl and simd versions.

In nuked opl the register writes occur at the end of 3rd, 5th, 7th, 9th, ... tick. In simd versions the register writes occur at the end of 1st, 4th, 7th, 10th, ... tick.

This results in different output with simd versions vs nuked opl.

Possible fix is to change the line chip->rq_delay = AYMO_(REG_QUEUE_LATENCY); in function aymo_(rq_update) to chip->rq_delay = AYMO_(REG_QUEUE_LATENCY) - 1; and add the line chip->rq_delay = AYMO_(REG_QUEUE_LATENCY); to the end of function aymo_(ctor).

A minor unrelated issue is that in file aymo_ym7128_common.h you're declaring variable aymo_ym7128_kernel_minèhase instead of aymo_ym7128_kernel_minphase.

TexZK commented 5 months ago

Hi! Thanks for spotting some misalignments! I tried your proposed solution, but I still get some differences by replacing aymo_(write)()/OPL3_WriteReg() with aymo_(enqueue_write)()/OPL3_WriteRegBuffered() within app_run() of test_ymf262_compare_epilogue_inline.h.

I think that's because my implementation pre-calculates as much as possible within aymo_(write), while Nuked OPL3 performs some updates within the DSP routine, delaying the effect of some register write. I don't know if it's mandatory to compute those "steady-state" values within the DSP routine, or if it's just some arbitrary legacy code.

Up to now this should be the only known difference between AYMO and Nuked OPL3. I didn't give it much importance because that just delays the effect of a register write, which would be affected by CPU/interrupt/HW latencies in an actual OPL3 system.

M-HT commented 5 months ago

I checked the failing tests and the register writes occur at the same tick in nuked opl and the simd versions (with my solution).

Just for testing I moved the the code block // Fix tremolo updates delayed w.r.t. AYMO from test_ymf262_compare_epilogue_inline.h to the end of function OPL3_WriteReg in opl3.c. With this change all tests were ok, so my solution is correct.

Because of this code block, your implementation sometimes generates different output than the original nuked opl. Wouldn't it be better to change your implementation to match the original nuked opl ?

Regardless of this, here is an example file where the tests fail (with aymo_(write)()/OPL3_WriteReg()): example-difference.zip

I haven't checked why, maybe I should open a new issue ?

TexZK commented 5 months ago

Hopefully I fixed tremolo (as per Nuked OPL3) with commit 17913f24ec4c179adc9436d1945f14a131a1db16.

I tested the score file you provided, comparing the output of the two implementations. There must still be some rare 1-sample delay somewhere on the right channel, perhaps linked to OPL_QUIRK_CHANNELSAMPLEDELAY (to be verified), It looks like DooM E1M1 is affected by the same tiny issue as well.

M-HT commented 5 months ago

You forgot to add eg_tremoloreq to file aymo_ymf262_arm_neon.h and to remove the code block // Fix tremolo updates delayed w.r.t. AYMO from file aymo_ymf262_none.c.

TexZK commented 5 months ago

Ah! I'm rather tired today 😅

Patched with commit 4374fe5974dab519355e865f5eb54b96296734f8, tested on RPi5

TexZK commented 5 months ago

Ok, commit f99285b9cfac01aebad91c8b92ca533de2d0636d corrected a few wrong slot delays related to OPL_QUIRK_CHANNELSAMPLEDELAY.

As per the open point about queue delays, I'd have to change a few scheduling policies to make them match Nuked's. I'm not really too much into it, because I think what Nuked does is a little overkill, and the code a little obfuscated. I mean, AFAIK Nuked schedules each writing w.r.t. the latest scheduled event (+2), as if it were a MIDI score. I prefer to just enqueue the register into a FIFO, which is then dequeued as soon as possible (+2). IMO this is what an actual PC would do, or at least what an emulation would; you don't schedule port outputs, you just make the CPU execute the out instruction.

M-HT commented 5 months ago

I see three problems/differences with your queue implementation: 1) You're not dequeuing as soon as possible +2, but as soon as possible +3. I fixed it by changing the line chip->rq_delay = AYMO_(REG_QUEUE_LATENCY); in function aymo_(rq_update) to chip->rq_delay = AYMO_(REG_QUEUE_LATENCY) - 1;. 2) In nuked the earliest dequeue occurs in the third tick (first tick in your implementation). I fixed it by adding the line chip->rq_delay = AYMO_(REG_QUEUE_LATENCY); to the end of function aymo_(ctor).

In my understanding with these changes your implementation matches nuked (no test I did shows a difference) except the third problem.

3) In your implementation you're not handling buffer overflow in rq_buffer. In nuked this is handled by writing the oldest queued value immediately (which frees a place in the queue). I don't have a real world example where buffer overflow occurs, but it should probably be handled, just in case.

I found another example where tests fail: example-difference-2.zip

TexZK / aymo

Incorrect timing of queued register writes in ymf262 #1