AidanHockey5 / STM32_VGM_Player_YM2612_SN76489

A Sega Genesis music player based off of the STM32 BluePill board and real YM2612+SN76489 sound chips.
GNU Affero General Public License v3.0
107 stars 14 forks source link

Some tunes with heavy (drum) samples usage slow down the replay speed #2

Closed rylecqfd closed 3 years ago

rylecqfd commented 6 years ago

Hi

It seems that some tunes that make heavy use of (drum) samples have their speed slowed down each time a sample is played.

Here is one example: Matt Furniss - The Terminator.

AidanHockey5 commented 6 years ago

Yep, that is a known limitation of the hardware. The tracks you have picked are by an artist who is known for pushing the Genesis' musical hardware to its limits, so my little player is probably not going to be able to keep up with the heavy usage of PCM samples.

Most VGM files will play back at full speed regardless of their PCM sampling. Just a few particular PCM-heavy tracks saturate the data bus and prevent full-speed playback.

One particular torture test that I like to use is actually Green Hill Zone from Sonic the Hedgehog. That track's PCM drums often brought my previous prototypes to their knees. The MegaBlaster can keep up for the most part, but the tempo still lags behind just a bit. I'd be interested to see if anybody could modify the firmware to write PCM samples faster than what I have here!

natarii commented 5 years ago

This is an old issue, but I briefly looked at your code, and while I don't have any STM32 hardware to test this on, I suspect you are running into this problem due to the overhead of setting up and tearing down the SPI transfer from the SRAM for every sample.

I am working on a similar project and it is able to play these files without any speed issues. My solution is instead of using a large chunk of RAM, use relatively small buffers (a few K) and load the VGM file using separate streams into two FIFOs: one for raw VGM commands, the other for the attached PCM data. When reading an 0x8n command from the file, calculate the offset of that sample and read it into the PCM FIFO before inserting the 0x8n command into the VGM FIFO. This way, during playback, when you hit an 0x8n you can just pull the first sample out of the PCM FIFO.

Possible complications:

That being said, it's probably worth dealing with this to get the ability to play "busy" VGMs, and also VGMs that have more PCM data than would fit in the SRAM.

Here are some demo videos of playing a few of the tracks mentioned in this issue. Just recorded real quickly with my phone camera, but you get the idea. This technique is being used. https://youtu.be/QawYF8nzBoE https://youtu.be/OeX5HKd4tyY

Links to my project source. The relevant parts will be in loader.c and driver.c: https://git.agiri.ninja/natalie/megagrrl/blob/master/main/loader.c https://git.agiri.ninja/natalie/megagrrl/blob/master/main/driver.c

If you contact me off GitHub I can also provide further pointers on implementing this, or VGM DAC stream control support!

AidanHockey5 commented 5 years ago

Wow, first off, let me just say that your player is absolutely remarkable. I'm jealous! Your playback speed is flawless and it's amazing to see this setup running on an ESP32! Great to see others in this space.

So you mention FIFO buffering - Long ago, this project did actually have a buffer system similar to what you're talking about (I think there are still artifacts of the buffer system commented out). Somewhere along the line, I discovered that removing my buffers resulted in more consistent playback speeds overall, but a few tracks seemed to suffer whenever they contained busy PCM sequences. I probably need to rethink how I handled buffering before and implement a system closer to what you've explained above.

Changes to how the SD card works would require a hardware revision, which I'm not against, but I think I'm just about out of I/O pins on the STM32 board I'm using, so that is going to require a bit of engineering. I'm pretty certain that the current read-speeds that I'm getting are plenty fast enough for FM and PSG data, but for reading PCM samples straight off of the card? No way (hence the RAM). Though, seeing how flawlessly fast your project seems to play back, I might need to reconsider my stance!

Most of this project's code is frankenstiened-together from previous iterations of the project. Back when I first started, there was very little info on anything like this, so there is a lot of messy trial-and-error R&D code still left over. I think I should refactor the entire thing one day. I think that this project went through three different architectures (ESP8266->Teensy 3.5->STM32), so my code is all over the place!

Your insight is fascinating to me - these are problems that I suspected, but could never fully "realize" because I wasn't really sure what to look for exactly. Thank you for the advice :)

piklz commented 5 years ago

is this something that can be updated in your software ? or does it mean hardware swap? im def making one anyway but would like to have one that functions well with all songs if possible

AidanHockey5 commented 5 years ago

I think there is a chance that performance could be significantly improved via software, but it would require a pretty extensive rewrite. Implementing a buffering system like @natarii would be pretty tricky, but it would likely drastically improve PCM playback speeds. I'm still of the opinion though that I'm a bit hardware limited, so I may need to rethink some of my board layout too. Either way though, I think rewriting the software for this project is a definite goal of mine, so I'll see what I can do in that regard.

piklz commented 5 years ago

That would be awesome both of you working together would be neat ! I'm going to order your bom/pcs etc anyway and make the current version so hopefully any improvements will be greatly appreciated!

natarii commented 5 years ago

Hi, sorry, I didn't check GitHub for a few days.

I ordered an STM32 Blue Pill board to play with. I didn't realize it had "only" 20K of RAM, but I have a feeling this is still doable on it. My earlier experiments in this project were on an ATmega2560, a Teensy 3.6, and finally an ESP32, so yeah I've gone through several revisions as well. Everything I've done was also doable on the Teensy 3.6, just a bit hackier because it doesn't have the luxury of the ESP32's second core and all the FreeRTOS queues and stuff to pass data around.

Unsure about whether 1-bit SD mode will be fast enough to do this in realtime. A lot of VGMs seem to have tons of redundant register writes when playing PCM, and you may be looking at a requirement of 150KB/s+ read speed to be able to handle it. I think this is well within the theoretical read speed, but it may work out differently in practice, and/or be dependent on the specific card. Getting rid of the SPI RAM may free up enough IOs to use the card in 4-bit mode, assuming there are no restrictions on which pins can be used for that. (Lack of IOs pushed me to use shift registers for the sound chip bus on the ESP32, but I'm also hitting them pretty fast at 20+ MHz)

You'll probably want to move the actual chip output stuff into a timer interrupt, and either get a library or write from scratch some ISR-safe FIFOs. You can also just disable timer interrupts when sticking new data into the FIFOs although that isn't quite as clean.

If you want to collab on anything, let me know, I'd enjoy that. I can also get you a schematic of my hardware, only haven't put it in git yet because there are a few minor errors that need to be fixed, although they're related to power supply stuff rather than anything VGM-related

AidanHockey5 commented 5 years ago

I'd love to collaborate! I'm really interested in seeing how you'd approach this on a blue-pill board. The RAM requirements really make buffering systems tough to pull off. I think the ESP32 has something along the lines of 520KB of RAM (wow) which would be an absolute luxury for me. Buffering out FM+PSG data shouldn't really be an issue, but as always, once PCM samples come into play, the game changes a bit.

Another thing that's a bit jank hardware-wise with my player is how the main 8-bit data bus is set up. As much as I would have liked to have directly used a port from the MCU, there isn't an "uninterrupted" row of GPIO on the same port that doesn't have other crucial data bus types on them (SPI, UART, I2C, etc.). So I'm forced to span over several ports and have to write to them using a really ugly block of code.

((data >> 0)&1) == HIGH ? GPIOB->regs->ODR |= 1 << 8 : GPIOB->regs->ODR &= ~(1 << 8); //PB8
((data >> 1)&1) == HIGH ? GPIOB->regs->ODR |= 1 << 9 : GPIOB->regs->ODR &= ~(1 << 9); //PB9
((data >> 2)&1) == HIGH ? GPIOC->regs->ODR |= 1 << 13 : GPIOC->regs->ODR &= ~(1 << 13); //PC13
((data >> 3)&1) == HIGH ? GPIOC->regs->ODR |= 1 << 14 : GPIOC->regs->ODR &= ~(1 << 14); //PC14
((data >> 4)&1) == HIGH ? GPIOC->regs->ODR |= 1 << 15 : GPIOC->regs->ODR &= ~(1 << 15); //PC15
((data >> 5)&1) == HIGH ? GPIOA->regs->ODR |= 1 << 0 : GPIOA->regs->ODR &= ~(1 << 0); //PA0
((data >> 6)&1) == HIGH ? GPIOA->regs->ODR |= 1 << 1 : GPIOA->regs->ODR &= ~(1 << 1); //PA1
((data >> 7)&1) == HIGH ? GPIOA->regs->ODR |= 1 << 2 : GPIOA->regs->ODR &= ~(1 << 2); //PA2

I'm not sure how much this slows things down since a boolean operation has to be performed for every bit, but it's certainly faster than digitalWrite()

One thing that I'm fascinated by is your playback speed over shift registers. Long ago, my players used 74HC595's on the data buses, but they were abysmally slow. Perhaps it was because my software back then was very primitive, but I've always steered clear of a shift register-based approach since. I'm wondering if I should explore using them again to free up some I/O and maybe add 4-bit SD support instead.

@natarii if you could send me an email here https://www.aidanlawrence.com/contact/ I'd like to connect with you and collaborate!

natarii commented 5 years ago

It does have 500some KB of RAM, but I'm only using 20K total of that for VGM/PCM data. I'm pretty sure you could get away with less, I just wanted to be on the safe side since the CPU core that's filling the buffers is also busy handling the display and other tasks.

Yup, finding a block of GPIO on a single register is a pain. Writing all those different registers is basically all you can do, but I would hope that writing them is relatively fast (it certainly is on the Teensy, I had to do the same there). Guessing your bottleneck isn't there (setting each bit probably only takes a handful of cycles)

For shift registers, I'm using 74HCT595 parts. Specifically the HCT family, because you can really hammer them (the parts I chose are rated for something like a 30MHz max clock at room temperature) and you get "free" 3.3->5v level conversion. Using the SPI peripheral to drive them with a hacked library that avoids some of the setup time (even very fast setup times get expensive when you're only writing two bytes at a time). Once you get the clock that high, you have the ability to write to the YM2612 faster than it can accept, so the 2612 itself becomes the bottleneck.

I'll contact you~

jareklupinski commented 5 years ago

When I ran into this issue, I experimented with preloading just the PCM table into RAM to and looking it up later using modified offsets from the sound file. Worked for every song except Sonic Spinball Lava Powerhouse :)

Eventually i switched to adding a high speed spi flash chip and dumped the PCM table there for lookup. hmu and I'll tell you all the other ways I've failed ;) https://github.com/jareklupinski/sega-genesis-forever/blob/master/sega-genesis-forever.ino

AidanHockey5 commented 5 years ago

Hey Jerek! Your SPI Flash chip solution is actually pretty similar to what I'm doing with my external SPI RAM chip, but if I'm reading your code right, it looks like you also have a little PCM buffer on the MCU as well.

Every time I begin a track, I'll look for PCM data and shove it into the SPI RAM chip. Once it's in RAM, I can simply send over the address that the VGM data is calling for and the RAM chip will spit back the sample I need. For the most part, this works fine - but some of the busier tracks seem to kill my playback speed. I've also tried making a small local PCM cache system for tracks that have PCM data that can fit within about 12K of RAM, and while it helps a little, there is still quite a bit of slowdown on a couple of tracks.

Besides, now that I look at it, I think my code is a big mess and needs to be totally refactored. Here's hoping for more performance with minimal hardware changes!

AidanHockey5 commented 5 years ago

Hey everyone, I have just rewritten the firmware from scratch. Version 2 is up and ready to go. Here are the notes on my changes: https://github.com/AidanHockey5/STM32_VGM_Player_YM2612_SN76489/pull/3

Playback speed has been significantly improved, though, there are still limitations with the hardware that are keeping me from playing songs like The Terminator tracks listed above. I simply don't have enough RAM to keep everything cached away in buffers and since my SD card is in 1-bit mode, I can't replenish the buffers fast enough to keep in time. Therefore, unless there is someone out there that can perform some serious magic, I think this is about as good as it's going to get for this iteration of the hardware.

For all the other tracks with more reasonable PCM sessions, this new version is night-and-day better!

jnftech commented 5 years ago

Hi Aidan! Random curiosity as I have been playing with the Megablaster and have experienced this slowdown. Has anyone tinkered with the STM32 clock speed? I understand overclocking is possible, and through messing around with the code (warning = I'm not a coder!), I was able to get the board and code compiled for 96mhz. Issues being the playback speed was faster than normal, and the timing of data to the chips seemed off as well (the sound was very glitchy). I tried faster than 96mhz (I read that folks are able to get these as high as 128mhz on the stock crystal) but at that point the sound wouldn't work at all. The changes I made were very few so I imagine there are other timing things in the code I'm not skilled enough to tweak.

To get to work at all, I set "board_build.f_cpu = 96000000L" in the platformio.ini file, manually edit a platformIO source file (.platformio\packages\framework-arduinoststm32\STM32F1\variants\generic_stm32f103c\wirish\boards_setup.cpp) to include 96000000 as an option (code below), and then had to manually set the SDcard to 36mhz (instead of F_CPU/2 on line 141).

Github wont let me upload audio files so here are links to some samples (The Ecco track has slowdown when the samples hit). http://temp.jnftech.net/2019082701 Megablaster Overclock Test Ecco.m4a http://temp.jnftech.net/2019082701 Megablaster Overclock Test ShinobiIII.m4a http://temp.jnftech.net/2019082701 Megablaster Overclock Test SoR2.m4a

@ Line 50 of boards_setup.cpp

#ifndef BOARD_RCC_PLLMUL
  #if F_CPU==72000000
    #define BOARD_RCC_PLLMUL RCC_PLLMUL_9
  #elif F_CPU==96000000
    #define BOARD_RCC_PLLMUL RCC_PLLMUL_12
  #elif F_CPU==48000000
    #define BOARD_RCC_PLLMUL RCC_PLLMUL_6
  #endif
#endif

What are your thoughts? (or am I crazy for trying this? :) )

krsshtx commented 4 years ago

Hi Aidan! Can you check playback of this ost? https://project2612.org/details.php?id=124

AidanHockey5 commented 4 years ago

Hello krsshtx,

Battletoads is infamously difficult to play back, so it's unlikely that my player would be able to render it without any slowdown. A project that can handle that OST would be Natalie's MegaGRRL player! https://kunoichilabs.dev/

krsshtx commented 4 years ago

Battletoads is infamously difficult to play back, so it's unlikely that my player would be able to render it without any slowdown.

Looks like it's more complex - beeping on some tracks and tempo drift on others .

I have another strange issue. Clicking sound instead of drums in high seas havoc o_O No drums at all.

https://project2612.org/details.php?id=343

krsshtx commented 4 years ago

case 0x4F: case 0x50: case 0x52: case 0x53:


return 1 1: 1/44100 = 22675,736961451ns

ym2203 pdf: Address set-up time 10ns min Write data setup time 100ns min Write pulse write width 200ns min

krsshtx commented 4 years ago

re-mix https://github.com/krsshtx/YM

AidanHockey5 commented 3 years ago

6

At long last with much trial and error, this issue has been fixed. The reason behind it was mostly due to a misunderstanding I had with how VGM data was supposed to be parsed. Instead of literally taking one sample every 1/44100th of a second, you are supposed to write entire blocks of data to the chips as soon as possible and ONLY tick delay commands at 44100Hz.

Early tests seem stable enough. Even write-heavy tracks like ones from The Terminator OST seem to work admirably.

Sorry for the delay on fixing this! I thought it couldn't be fixed.

Hey, at least @natarii was the best thing to come out of this issue, haha.

krsshtx commented 3 years ago

I thought it couldn't be fixed.

Хорошо, что не сдался. Как насчёт ym2608?

krsshtx commented 3 years ago

From the other side...

As I thought you've abandoned your project and I had a non-working properly device which I really loved I had no choice but to re-mix your code.

Your words "A project that can handle that OST would be Natalie's MegaGRRL player!" confused me a lot. Dissapointed a lot. M68K/Z80 was able, but STM32 can't. Lets see..

I'am not a programmer. And had no knowledge in arduino ide and stm32 at all. After a lot of building/uploading and making lpt 2612 device to test i've made a working build which played high seas havoc, rocket knight adventures, terminator and even battletoads which you've mentioned as unplayable. It took much time. Your code was:

if(waitSamples == 0 && !samplePlaying) { samplePlaying = true; waitSamples += parseVGM();

Mine:

while(waitSamples == 0 ) { waitSamples += parseVGM(); waitstate=false;

Your:

case 0x52:
{
uint8_t addr = readBuffer();
uint8_t data = readBuffer();
ym2612.Send(addr, data, 0);
}
return 1;
case 0x53:
{
uint8_t addr = readBuffer();
uint8_t data = readBuffer();
ym2612.Send(addr, data, 1);
}
return 1;

Mine:

 case 0x52:
{
 addr = readBuffer();
 data = readBuffer();
 ym2612.Read();
ym2612.Send(addr, data, 0);

}
return 0;
case 0x53:
{
 addr = readBuffer();
 data = readBuffer();
    ym2612.Read();
ym2612.Send(addr, data, 1);

}
return 0;

It was even neccesary to make ym_read function. Because it became too fast.

And now you've changed your mind telling "@natarii was the best thing" and started to sell it.

I am very impressed not to say more.