Music sequencer - Githubissues

aduros commented 3 years ago

Music can be implemented right now by games calling tone() at the right times. It involves writing a bunch of code to handle the time sequencing... but it's a lot of work.

Let's build a music sequencer and a music authoring format into the runtime. The goal is a new music() method which takes something as a parameter to play music using the 4 channel audio system.

There are some approaches here for that something:

A text based format similar to MML or Alda. Would be quickest to get up and running and it's really neat to program music in text. Might be too hard to use for anything meaningful though. We would also need to create an authoring environment.
A binary format, exported from a graphical tool like BeepBox or FamiStudio. BeepBox looks great and is really easy to use, though it does require the right presets. We would need to write a new w4 command similar to png2src that generates the blob in source code.
Use straight up MIDI?? I don't know enough about MIDI to know if this is feasible, but it seems like overkill. Tools for beginners seem lacking, but it might be ideal for pro musicians.

The music format should be usable for short sound effects too and not only background music. Maybe the function should be called something like track() or sequence() instead of music() to reflect that.

Right now I'm leaning towards BeepBox, and a w4 beepbox2src command that takes a beepbox.co URL or json export and spits out binary data in source code.

Packbat commented 3 years ago

Is it possible to make some kind of binary custom MML?

Like, for example, if pairs of hex digits represent individual MML instructions, that might let someone who wants to type everything into the source code directly do so, but keep data relatively compact for people exporting from external tools. Maybe 00-7f indicate pitches expressed as MIDI note numbers (3c = 60 = middle C) and 80-ff are control signals that let you modify the envelope and tone mode and indicate upcoming frequency slides.

aduros commented 3 years ago

I love that idea! We could provide macros/inline functions to make it easier to type into source code.

static unsigned char music[] = {
    VOLUME(50) | SUSTAIN(30) | DECAY(10) | BPM(160),
    C, D, E, F, G,
};

Packbat commented 3 years ago

For sure, yeah!

That said, I think a single byte isn't going to be able to record all the control modes? PICO-8 makes me think eight volume levels is near the minimum I'd want (3 bits), there are four WASM-4 channels (2 bits), the first two each have 4 tone modes (2 bits), BPMs ranging from under 40 (90 frames per quarter note) to over 200 (18 frames per quarter note) are all common (say, 5+ bits if we use intervals of 4 frames), note lengths from sixteenth notes to whole notes with dotted versions are all common (3+ bits) ... and we're already at 17 bits worth of parameters when we only have room for 7, and we haven't touched envelope parameters yet. Or dedicated control codes for rests and frequency glides.

I'm not sure what the most logical way to break it up is (probably a mix of 'this byte range specifies these' and 'this control code specifies what the next byte or bytes mean'), but thinking musically:

BPM and WASM-4 sound channel would be specified pretty rarely, usually once right up top.
- Maybe instead of storing sound channel in the MML, sequence() or whatever it's called would take both MML data and sound channel as inputs?
Tone mode in the square/pulse wave channels would probably be specified pretty rarely.
Attack, Decay, and Release would probably also be changed rarely, but more often than BPM and channel.
Sustain probably changes frequently, but I think you could make it change less frequently by specifying at what fraction of the total note length the sustain ends - essentially, legato vs. staccato.
- So, for example, if the note length were 30 frames and Attack and Decay were each 3 frames, a 50% sustain would mean 15-3-3=9 frames ... but the same at a note length of 15 frames would be 7.5-3-3=1.5=2 frames, and at a note length of 8 it would be 0 frames, because 4-3-3=-2<0.
Note length and volume would both change often.
Glides would happen often, both melodically and to make twangy sounds and percussion sounds.
- Maybe separate control codes for "glide from the pitch of the previous note to the next one" and "glide between the next two pitches instead of playing them sequentially"?
- Maybe also a "tied-note" control code that plays the next note with zero Attack and Decay? That way, you could have a quick bend into a note and then hold it without needing to change envelope parameters and then change them back.

Packbat commented 3 years ago

I guess what I'm imagining right now is something along the lines of:

0b10vvvlll single-byte control code to change volume (vvv) and/or note length (lll),
0b11000sss single-byte control code to change sustain percentage (sss), and
0b11001000-0b11111111 individual pre-defined control codes for things like:
- Glide from previous note's pitch to following byte's pitch.
- Glide between the pitches specified in the next two bytes,
- Set tone mode and BPM.
- Set [Attack/Decay/Release] envelope parameter.

...plus room for expansion.

aduros commented 3 years ago

Thanks for the insightful details! That bit layout looks perfect!

If I understand correctly, this would be fully compatible with MML text mode? So the byte stream could contain both the single-byte control code to set the volume or the string, eg: "v75".

Maybe instead of storing sound channel in the MML, sequence() or whatever it's called would take both MML data and sound channel as inputs?

That sounds good to me. I researched a bit into how MML handles multi-channel but couldn't find anything somewhat standard :shrug:

Are you aware of any MML editors? Being able to use existing tooling around this sort of thing would be nice.

Packbat commented 3 years ago

Thanks for the insightful details! That bit layout looks perfect!

Sure thing!

If I understand correctly, this would be fully compatible with MML text mode? So the byte stream could contain both the single-byte control code to set the volume or the string, eg: "v75".

I ... would assume not? I think any ASCII characters would be in the 0-127 range that would be interpreted as note pitches. I don't actually know enough about data structures to understand how this works - I thought static unsigned char music[] = { &c. meant that each entry represented one byte of data in the final binary, so I don't know how that'd mix with strings.

Maybe there could be a string mode? I don't know.

Are you aware of any MML editors? Being able to use existing tooling around this sort of thing would be nice.

Not personally, but when I was poking around looking for information on historical use of MML-formatted data, I found the VGMPF wiki page on it with a big list of examples. Maybe some of those would be good inspiration?

aduros commented 3 years ago

I ... would assume not? I think any ASCII characters would be in the 0-127 range that would be interpreted as note pitches.

I think this would be closer to your original idea of using ASCII for 0-127 and reserving 128-255 for control bytes. So the string can either contain the 3-byte sequence "v50" to set half volume, or a single byte 0b10vvv000 using the above bit layout. One is nicer to write in source code, the other is better for tool output.

Anywho, just an idea, and probably premature optimization :smile: It would be nice to get an implementation going, MML or otherwise, and iterate from there.

Do you have any thoughts on BeepBox/FamiStudio?

Packbat commented 3 years ago

I think this would be closer to your original idea of using ASCII for 0-127 and reserving 128-255 for control bytes.

Oh crap - that wasn't my idea at all, actually. My idea was to use 0-127 as pitch numbers and not use ASCII at all. Like, v50 would not mean "set volume to 50", it would mean "note 118 (C#8, 7459 Hz), note 53 (F3, 175 Hz), note 48 (C3, 131 Hz)", because those three ASCII characters all represent numbers under 128.

That said:

Anywho, just an idea, and probably premature optimization :smile: It would be nice to get an implementation going, MML or otherwise, and iterate from there.

Definitely agree that this is premature optimization, for sure. Maybe - and I'm not the one doing any of the programming, so I'm probably the wrong person to ask - but maybe the best way to go after this is to start by making a version with string-based MML, experimenting with the parameters, and only turning it into full binary data later? That might let whoever is doing the coding work make changes more quickly than if they were working strictly in binary, while still using specifications (like, how many bits of precision are provided for each parameter) based on the future binary version.

Do you have any thoughts on BeepBox/FamiStudio?

I haven't tried either? Just looking at cutesy nonsense names in the scales dropdown in BeepBox: ...I'm already profoundly uninterested in giving that a try, but I'll see what I think of FamiStudio and get back to you.

Packbat commented 3 years ago

Oh, question: would it make sense to have "start loop" and/or "end loop" control codes in the sequencer? I think there should probably be some way to implement looping background music; that might be a way to do it.

aduros commented 3 years ago

Looping sounds great to me.

In any case we'll need a way to stop a running music sequence from code, probably with another function.

maybe the best way to go after this is to start by making a version with string-based MML, experimenting with the parameters, and only turning it into full binary data later?

Agreed :smile:

Packbat commented 3 years ago

Another thought I had: how hard would it be to provide the source code for the music sequencer in a way that's straightforward for programmers to copy into their own code and modify? I feel like there'll be less pressure to add new features to the built-in sequencer functions if you make it easy for people to mod tremolo or vibrato or whatever else in themselves.

(Edit: Unrelatedly, I've started looking through the documentation for FamiStudio - I haven't written anything in it yet, but it definitely feels like a good choice to make an importer from. It even defaults to the same four channels - square, square, triangle, noise - which I imagine would make conversion a lot easier because there's no ambiguity about which channel means what.)

aduros commented 3 years ago

(Edit: Unrelatedly, I've started looking through the documentation for FamiStudio - I haven't written anything in it yet, but it definitely feels like a good choice to make an importer from. It even defaults to the same four channels - square, square, triangle, noise - which I imagine would make conversion a lot easier because there's no ambiguity about which channel means what.)

Yeah, WASM-4's sound system is basically the same as the NES minus the DMC channel, so it should map pretty well.

I'd urge you to give BeepBox a closer look... behind the cutesy names is a powerful but accessible tool.

Packbat commented 3 years ago

I'm sorry, I have to set a boundary here: don't urge me to do things that I've told you I'm not going to do.

In any case, my dealbreakers don't have to be your dealbreakers; I mostly brought the scales thing up as a reason not to standardize on BeepBox as the primary or only external tracker to import music from. I'm sure I'm not the only person who would have an issue with that - in fact, I know I wouldn't be, because I talked about it with some friends and they said it was "yipes" and "incredibly frustrating" and "patronizing as [expletive removed]" - and it's better if there are options that don't provoke those reactions.

Accessibility is a good thing, though - I don't have any particular suggestions on that front. FamiStudio is still pretty intimidating and I don't have a comfortable workflow in it the way I do with PICO-8's built-in tracker. (So much clicking...)

Packbat commented 3 years ago

Another thought, looking at FamiStudio: the idea of making patterns and stringing them together into sequences is probably transferable and would save space. Probably not something for the first test version of the sequencer, but maybe to add later?

aduros commented 3 years ago

Yeah, the concept of a bank of patterns is interesting and seems to be shared across different tools' export formats. Both FamiStudio and BeepBox's export formats look more or less like this:

Instrument definitions: properties of a note (ADSR, duty cycle, volume)
Pattern definitions: timing of individual notes
A list of patterns: simple list of pattern indexes to string together to play a song.

That is to say, it shouldn't be hard to support multiple authoring tools.

JerwuQu commented 2 years ago

Regarding implementation, are you thinking that this sequencer would be implemented in the "hardware", or as a small optional library bundled into carts which would use the existing tone(...)?

I'm asking mostly because I think the former would add unneeded complexity to the console itself.

It also raises the question if music(...) would be able to do things not currently possible with tone(...) (e.g. play notes not aligned to 60hz, or pan notes left/right). If not, there's not really any reason why it has to be implemented in the hardware/runtime, and would keep the current simplicity if not.

For having a more precise BPM when using tone(...), rather than making a separate system for music, there could be a more generic Timer that the developer could use to call a function after 10-1000ms (for example), which would in turn also be useful for what is mentioned by the author of #24.

What are your thoughts? I find all of this very interesting.

aduros commented 2 years ago

My current plans are leaning toward designing some kind of simple music bytecode similar to tracker formats. Multiple source languages/tools could then compile to that bytecode.

This would be in the runtime itself as a convenience, and be implemented entirely using tone. This is similar to how we provide blit convenience methods, even though users can always implement their own rendering using the FRAMEBUFFER register. Advanced users can forgo music and write their own sequencer directly if they choose.

Normally I'm not keen on adding stuff to the runtime, but music is so common, and I think we can come up with a design that's minimal.

joshgoebel commented 2 years ago

Lots of thoughts above on the "how" just wanted to mention this is [obviously] not a new space... lots of prior art we can look to for inspiration. Just a few things I've personally worked with:

https://github.com/moduscreate/ATMlib2 - "Arduboy Tracker Music and is based on Squawk a minimalistic 8-bit software synthesizer "
https://github.com/nesbox/TIC-80/wiki/RAM#music-patterns

Having a music API to make music a lot higher level would indeed be a huge welcome addition to the platform. At that point would we be free of notes needing to be stuck to 60hz boundaries?

aduros commented 2 years ago

Thanks, those are useful links!

About 60hz boundaries, maybe, but I'm not thinking of this as adding any new functionality on top of the lower level tone.

sergeypdev commented 1 year ago

Hello, I recently started using wasm4 and got stuck implementing music. I want to share my thoughts about this.

I wish there was a memory mapped buffer where I could put instructions for the audio system ahead of time. This would bring 2 benefits:

It would be possible to have more precise timings, because multiple buffer commands can be queued ahead of time with more precise timings and audio system can run with a different frequency
It could serve as a simple music format, so instead of adding a new music function you could just copy pre-recorded commands from embedded file each frame
Existing tone function can be easily re-implemented on top of the new buffer api, so there are no backwards compatibility concerns

JerwuQu commented 1 year ago

This is a pretty good idea! I had a similar one in #472 some time ago (the tone queue thing), but didn't go further than timing then.

The API command could be as easy as music(buffer, size) or music(buffer, count) depending on which makes more sense for the format.

As I see it this leaves us with multiple options:

1. Keep all the limitations of `tone`, with time and frequency aliasing

This could be implemented as simply as 4x 32-bit ints (16 bytes) repeating in memory, being the 4 parameters of tone. This doesn't really add anything of value though, since the developer could easily implement this themselves instead with no downside.

2. Improve upon the parameters currently given to `tone`

If going with precise timing, one option would be to make the current parameters in the tone buffer take millisecond durations rather than frame durations for sub-frame timing. Doing this leaves some range to be desired though. The current tone uses only 8 bits for each duration, and 255 ms is not a lot of time, and would lead to a lot of repeated commands for a tone. This makes a case for taking larger values, and if we're creating a new interface, we might as well improve other parts as well.

One suggestion would be to split up each packed parameter into its own full-width parameter. The flags parameter is a bit unclear since we're currently only using 6 bits of it, but I also don't know what would could added to it, so I'm sizing it to 1 byte here.

Format suggestions:

[start-frequency:f32][end-frequency:f32]
[attack-ms:u16][duration-ms:u16][decay-ms:u16][release-ms:u16]
[sustain:u8][volume:u8]
[flags:u8]

I also took the liberty of fixing some of the ADSR terms here. Note that this suggestion would increase the size from the 16-byte tone compatible version to 4*2+2*4+1*2+1 = 19 bytes. I see this as a fair trade-off for getting float frequencies (very desired, see #333 for another proposal), and millisecond level precision up to 65.5 seconds (also very desired as expressed in #472).

It could also be expressed in a more simple form without ADSR since we're expecting the developer to have filled a buffer with some tool anyways.

[start-frequency:f32][end-frequency:f32]
[start-volume:u8][end-volume:u8]
[duration-ms:u16]
[flags:u8]

This is 4*2+1*2+2+1 = 13 bytes, while still being able to do everything the other one can, and the easiest to implement in the runtime, but it might end up taking more space in memory since the user has to do any ADSR-like features themselves. On the other hand, it leaves open the possibility for doing crazy things like sweeping frequencies for each step in an ADSR envelope.

I personally like this one the most because it's very straightforward what each parameter means, and it's the smallest.

3. Create a variable-sized stream format

Arguably more complex, but not completely crazy perhaps...? Suggestions welcome :)

I created one for w4on that was used for Journey to Entoris in the last jam, but it never got completely finished. That one uses a more musical approach, using notes rather than plain frequencies, and has an arpeggio command.

One example, following what was said above, would be to use the remaining two bits of flags to mean "no end-frequency" and "no end-volume" (or "has end-frequency" and "has end-volume").

[flags:u8]
[start-frequency:f32][end-frequency:?f32]
[start-volume:u8][end-volume:?u8]
[duration-ms:u16]

This would save some space and make the command size vary between 8 bytes and 13 bytes, while still remaining somewhat simple. The downside is simply the fact that it's a variable-sized command and adds a bit more complexity.

sergeypdev commented 1 year ago

@JerwuQu thank you.

I think 1ms timing granularity might not be enough for some complex compositions, maybe timing can be a number of samples in 44100 frequency? So 44100 = 1 second. Or maybe just using floats where 1.0 = 1 second.

I didn't consider something when writing my initial comment: usually a memory mapped buffer interface assumes that you put commands on it each frame and in the next frame it would be cleared (similar to the framebuffer), but this would limit audio commands to only be as long as one frame, which means tone can't be implemented easily without some "magic".

JerwuQu commented 1 year ago

True, 1 millisecond isn't perfect, but it'd leave us with over 16x the amount of usable BPMs (without any time aliasing) compared to the current frame-based limitations. I do not believe it would be a limitation in practice. A real example would be making your 140 BPM 4/4 song instead be 140.187 BPM, or your 149 BPM song instead be 148.515 or 150.0 BPM. Here's how this was calculated in w4on for the current timing system.

I suppose increasing the duration field to an f32 with 2 extra bytes isn't too bad of a trade-off to not have to do these roundings though... I'm personally against tying it to samples and feel seconds/milliseconds would be a better units.

aduros / wasm4

Music sequencer #15

1. Keep all the limitations of `tone`, with time and frequency aliasing

2. Improve upon the parameters currently given to `tone`

Format suggestions:

3. Create a variable-sized stream format

aduros / wasm4

Music sequencer #15

1. Keep all the limitations of tone, with time and frequency aliasing

2. Improve upon the parameters currently given to tone

Format suggestions:

3. Create a variable-sized stream format

1. Keep all the limitations of `tone`, with time and frequency aliasing

2. Improve upon the parameters currently given to `tone`