Intent to showcase more results and/or stats?

mat100payette commented 1 year ago

Hello there!

I was made aware of this cool project, good stuff. Quick question, I know you said that it's mostly a project to satisfy your curiosity, but do you plan on showcasing more output from the tool? As a software engineer I'd be curious to examine the output you get before diving into understanding the codebase. Having like 10+ varied chart output that you consider representative of this tool's capabilities, and maybe also some stats on its performance along with what metrics it's evaluated on would be super appreciated.

I understand if it's not something you planned though :)

Good luck with continuing this neat project!

Keytoyze commented 1 year ago

There are some videos in Bilibili demonstrating the outputs, you can take a look:

https://www.bilibili.com/video/BV1Gk4y1J7xN https://www.bilibili.com/video/BV1r84y127Tu https://www.bilibili.com/video/BV16s4y1A7kc https://www.bilibili.com/video/BV1Ra4y1G7qR https://www.bilibili.com/video/BV1sh411j7aa https://www.bilibili.com/video/BV1es4y197wa https://www.bilibili.com/video/BV13h41177wA

As for the metrics, in the image generation field we can use FID or IS, but it's difficult to apply them to charting. Feel free to share your idea with me.

Thank you for your support for this project!

mat100payette commented 1 year ago

Hey thanks for the quick reply!

Allow me some time to study FD in general so I dont say nonsense. My background is in object detection so I dont have all the GAN knowledge required rn. At a first glance though, it seems like a promising metric if we were to figure out a representative way to encode a chart as some curve/distribution (uneducated guess lol).

In terms of results, I'm not entirely sure if you're simply limited to the state-of-the-art aspect of audio feature extraction. A GAN (or any type of ML model for that matter) can only do so much if the available audio features arent super accurate. One can easily see that the tool currently misses a lot of clear sounds that dont have a strong/loud attack, such piano notes during a section where other instruments kinda share the same frequency space. There are other issues such as consistency in what is layered, but I think that's only important once the problem of accurately pinpointing relevant notes is solved.

If you dont mind sparing me the search through the repo, what audio analysis tools does this project use to extract audio features?

Looking forward to the evolution of this ^^

EDIT: I see that you train a VAE first, which relates to what I said regarding the encoding of charts. A first good step to exposing a more "defined" performance value (even if qualitative), would be to write an explanation on how you determined that your VAE reached a sufficient performance threshold (along with any metrics you've used).

Keytoyze commented 1 year ago

@mat100payette Thanks for your valuable suggestions!

At a first glance though, it seems like a promising metric if we were to figure out a representative way to encode a chart as some curve/distribution (uneducated guess lol). A first good step to exposing a more "defined" performance value (even if qualitative), would be to write an explanation on how you determined that your VAE reached a sufficient performance threshold (along with any metrics you've used).

The current training loss is the denoising loss, aka the ability of the model to recover real VAE representation from noise. Maybe it can serve as a metric?

what audio analysis tools does this project use to extract audio features?

I used melspectrogram (https://github.com/Keytoyze/Mug-Diffusion/blob/master/mug/util.py#L133) with n_mels=128 to extract audio features. It appears to be a common technique in the fields of speech recognition and audio processing.

Currently I'm training a second generation model designed to solve the problems. I think there may be two possible reasons:

It results from the dataset, since there exists "break" sections in the charts, which often appears in the soothing music sections and the mapper ignores all sounds in these sections. I plan to remove the audio sound in the break section in the dataset. Another possible reason is that the dataset contains more electronic music and less piano, which reduces the model’s ability to recognize piano sounds. I plan to simulate many piano sounds to expand the dataset.
Mel spectrum may be lossy since it disregards the phase information. I plan to incorporate both mel-spectrum and raw audio wave into the model, similar to demucs.

I'd love to hear your comments or suggestions for improvements XD

mat100payette commented 1 year ago

I'll throw ideas/concerns your way, feel free to consider just those you think make any sense, if any.

1- First of all, I personally think that a simple melspectrogram, although popular in state-of-the-art, is still far from accurate enough for all the musical variance encountered in rhythm gaming; it's solid for simpler audio but that's it. Have you read this paper about BSRNN? From what I gathered, it significantly outperforms demucs on the same datasets. However, if I understood correctly, it requires expert fine-tuning of the frequency bands on a per-song basis. Maybe there's an opportunity to automate an optimization of that band selection per input. Also here's a snippet from the paper which seems to be in line with your plans:

Data Preprocessing: Similar to existing works, we use the MUSDB18-HQ dataset [42] for all experiments. During the preparation of the training data, we apply a source activity detector (SAD) to remove the silent regions in the sound tracks and only lead the salient ones for data mixing. Although any existing SAD systems can be directly applied, here we introduce a simple unsupervised energy-based thresholding method to select salient segments from a full track.

While I personally think it might be detrimental in a charting domain (because a lot of times some pretty quiet sections are still relevant to chart), maybe this is an indicator that your approach could be right. Not sure.

2- I've seen some generated charts (with this tool) from people in the community I'm in, and while the results are probably the best of any chart generation we've seen so far, none of what we generated came close to being remotely usable/salvageable. My analysis of the generated charts is that there is a clear lack of consistency when it comes to how a given instrument/sound is charted, which results in a chart that everyone perceives as "mostly random". Unfortunately, I'm not knowledgeable enough to say with confidence if this is an unavoidable result of the denoising approach or if something can be done about it. My guess is that there probably needs to be some step in the pipeline that selects/groups what sounds are layered, and chart those in a consistent manner.

3- As for the metrics, just because of the nature of VAEs you might have to kinda pick whatever loss you use yeah. Idk what that is in your denoising, but showing the numbers and explaining your interpretation of them given the domain of charting would be helpful for further studies ^^

Keytoyze commented 1 year ago

WOW, thanks for your comment very much!! I will read the BSRNN paper.

As for the consistency, I suspect that the model does not grasp the overall song well, since mappers may generate different patterns for a same instrument/sound in different charts, but the model may mix different patterns in a single chart. I'll try to make the model learn how to better charting from an overall perspective. I think your idea about layer the sounds makes sense, but I wonder how to layer the charts in the dataset unsupervisedly to generate the corresponding training data? Or you just want to use this pipeline during the inference stage?

For the metrics, maybe some snapping correctness can be used, e.g., ratio of missing/overmapping notes, or global offset of the chart. What do you think?

mat100payette commented 1 year ago

I'm really unsure about how to approach same-sound consistency on a per-chart basis. I think it's heavily dependent on what input you're able to feed it and that's limited to the trained VAE's output. I have some ideas but I don't yet fully understand how you integrate the conditions (audio + prompt + noise level) in the denoising training. If you don't mind, could you please try to explain that process to me? Without that, I can't really provide any insight on the loss function.

Now for the metrics (specifically the VAE's performance) I think what you suggest could work. Keep in mind that if you don't analyze your VAE's output independently (i.e. not combined to the denoising), you're most likely pushing too much unpredictable randomness to the denoising training.

A popular metric in object detection is mAP, which is a single value that basically tells you "on average, did I detect relevant things, and how close to the real objects were the detections?". It might be possible to apply that same logic to the VAE if you consider the groundtruth notes as objects to detect (in this case, to decode), and the quality of the decoding being "how close to the real note was I". In object detection you consider a "good detection" to be one that has a big enough IoU (given a manually chosen threshold). In a chart's space, you'd simply ensure that the decoded note is within t time of the real note. You'd also have to factor in the note type though, which introduces classification too. To evaluate both localization + classification, a tool called TIDE is generally used nowadays. Here are some images to give you a good idea of what it looks for:

Error types:

Plot/graph of the error type distribution: Error plot

^{Source: https://towardsdatascience.com/a-better-map-for-object-detection-32662767d424}

As you can see, all of these error types would be applicable to your VAE, except they're be in a 1D space instead of 2D, which makes them even easier to define.

I think that if you manage to get such information for your VAE, you'll have a much deeper understanding of the type and amplitude of noise your encoding output pushes to your denoising training. For example, if the VAE has a high amount of dupe errors, you'll know that it decodes duplicate notes close to each other instead of the single real note. If it has a lot of bkgd (decoded a note that doesn't exist), you'll know it creates random ghost notes during decoding, etc etc.

Hopefully this is helpful to you! I don't know if this project is in the context of your PhD or if it's a side project, so maybe you have time constraints I'm unaware of. Regardless, do let me know if/when some things are out of scope for you ^^

mat100payette commented 1 year ago

Sure thing, although it's nice for other people to be able to follow the discussion :) I don't mind chatting on discord if you want though.

Keytoyze commented 1 year ago

@mat100payette I have read the BSRNN paper. I found the core idea is similar to MelSpectrum, which splits more bands in low-frequencies and less bands in high-frequencies. BSRNN using different model (Norm + MLP) to extract features in each bands. As for transfering it to AI charting, I have two concerns:

Feature extractors for each band in BSRNN output N=128 features, therefore the total feature count for each timestamp is N*K where K is the number of bands. Such a large feature is difficult in charting, since BSRNN only processes 6 seconds length audio but we have to process several minutes length one. I plan to reduce N to around 4, which brings a similar computation complexity as using MelSpectrum. But idk whether this leads to reduced performance.
BSRNN proposes 7 versions of band-splitting strategies, as well as special strategies for Bass, Drum and Other. What strategy do you think fits the charting the best?

Keytoyze / Mug-Diffusion

Intent to showcase more results and/or stats? #14