GrandaddyShmax / audiocraft_plus

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
561 stars 63 forks source link

Documentation. #45

Open jasonsprouse opened 1 year ago

jasonsprouse commented 1 year ago

Could we please get a wiki or some documentation for the functionality of audiocraft_plus. This has just been integrated into @RunDiffusion and they have some documentation in the tab - your ui, but I would like a reference on your github page for the options and tools in the app.

MusGen Tab

Generate (button)]:
Generates the music with the given settings and prompts.
[Interrupt (button)]:
Stops the music generation as soon as it can, providing an incomplete output.
Generation Tab:
Structure Prompts:
This feature helps reduce repetetive prompts by allowing you to set global prompts
that will be used for all prompt segments.

[Structure Prompts (checkbox)]:
Enable/Disable the structure prompts feature.
[BPM (number)]:
Beats per minute of the generated music.
[Key (dropdown)]:
The key of the generated music.
[Scale (dropdown)]:
The scale of the generated music.
[Global Prompt (text)]:
Here write the prompt that you wish to be used for all prompt segments.
Multi-Prompt:
This feature allows you to control the music, adding variation to different time segments.
You have up to 10 prompt segments. the first prompt will always be 30s long
the other prompts will be [30s - overlap].
for example if the overlap is 10s, each prompt segment will be 20s.

[Prompt Segments (number)]:
Amount of unique prompt to generate throughout the music generation.
[Prompt/Input Text (prompt)]:
Here describe the music you wish the model to generate.
[Repeat (number)]:
Write how many times this prompt will repeat (instead of wasting another prompt segment on the same prompt).
[Time (text)]:
The time of the prompt segment.
[Calculate Timings (button)]:
Calculates the timings of the prompt segments.
[Duration (number)]:
How long you want the generated music to be (in seconds).
[Overlap (number)]:
How much each new segment will reference the previous segment (in seconds).
For example, if you choose 20s: Each new segment after the first one will reference the previous segment 20s
and will generate only 10s of new music. The model can only process 30s of music.
[Seed (number)]:
Your generated music id. If you wish to generate the exact same music,
place the exact seed with the exact prompts
(This way you can also extend specific song that was generated short).
[Random Seed (button)]:
Gives "-1" as a seed, which counts as a random seed.
[Copy Previous Seed (button)]:
Copies the seed from the output seed (if you don't feel like doing it manualy).
Audio Tab:
[Input Type (selection)]:
File mode allows you to upload an audio file to use as input
Mic mode allows you to use your microphone as input
[Input Audio Mode (selection)]:
Melody mode only works with the melody model: it conditions the music generation to reference the melody
Sample mode works with any model: it gives a music sample to the model to generate its continuation.
[Trim Start and Trim End (numbers)]:
Trim Start set how much you'd like to trim the input audio from the start
Trim End same as the above but from the end
[Input Audio (audio file)]:
Input here the audio you wish to use with "melody" or "sample" mode.
Customization Tab:
[Background Color (color)]:
Works only if you don't upload image. Color of the background of the waveform.
[Bar Color Start (color)]:
First color of the waveform bars.
[Bar Color End (color)]:
Second color of the waveform bars.
[Background Image (image)]:
Background image that you wish to be attached to the generated video along with the waveform.
[Height and Width (numbers)]:
Output video resolution, only works with image.
(minimum height and width is 256).
Settings Tab:
[Output Audio Channels (selection)]:
With this you can select the amount of channels that you wish for your output audio.
mono is a straightforward single channel audio
stereo is a dual channel audio but it will sound more or less like mono
stereo effect this one is also dual channel but uses tricks to simulate a stereo audio.
[Output Audio Sample Rate (dropdown)]:
The output audio sample rate, the model default is 32000.
[Model (selection)]:
Here you can choose which model you wish to use:
melody model is based on the medium model with a unique feature that lets you use melody conditioning
small model is trained on 300M parameters
medium model is trained on 1.5B parameters
large model is trained on 3.3B parameters
custom model runs the custom model that you provided.
[Custom Model (selection)]:
This dropdown will show you models that are placed in the models folder
you must select custom in the model options in order to use it.
[Refresh (button)]:
Refreshes the dropdown list for custom model.
[Decoder (selection)]:
Choose here the decoder that you wish to use:
Default is the default decoder
MultiBand_Diffusion is a decoder that uses diffusion to generate the audio.
[Top-k (number)]:
is a parameter used in text generation models, including music generation models. It determines the number of most likely next tokens to consider at each step of the generation process. The model ranks all possible tokens based on their predicted probabilities, and then selects the top-k tokens from the ranked list. The model then samples from this reduced set of tokens to determine the next token in the generated sequence. A smaller value of k results in a more focused and deterministic output, while a larger value of k allows for more diversity in the generated music.
[Top-p (number)]:
also known as nucleus sampling or probabilistic sampling, is another method used for token selection during text generation. Instead of specifying a fixed number like top-k, top-p considers the cumulative probability distribution of the ranked tokens. It selects the smallest possible set of tokens whose cumulative probability exceeds a certain threshold (usually denoted as p). The model then samples from this set to choose the next token. This approach ensures that the generated output maintains a balance between diversity and coherence, as it allows for a varying number of tokens to be considered based on their probabilities.
[Temperature (number)]:
is a parameter that controls the randomness of the generated output. It is applied during the sampling process, where a higher temperature value results in more random and diverse outputs, while a lower temperature value leads to more deterministic and focused outputs. In the context of music generation, a higher temperature can introduce more variability and creativity into the generated music, but it may also lead to less coherent or structured compositions. On the other hand, a lower temperature can produce more repetitive and predictable music.
[Classifier Free Guidance (number)]:
refers to a technique used in some music generation models where a separate classifier network is trained to provide guidance or control over the generated music. This classifier is trained on labeled data to recognize specific musical characteristics or styles. During the generation process, the output of the generator model is evaluated by the classifier, and the generator is encouraged to produce music that aligns with the desired characteristics or style. This approach allows for more fine-grained control over the generated music, enabling users to specify certain attributes they want the model to capture.

AudioGen Tab

[Generate (button)]:
Generates the audio with the given settings and prompts.
[Interrupt (button)]:
Stops the audio generation as soon as it can, providing an incomplete output.
Generation Tab:
Structure Prompts:
This feature helps reduce repetetive prompts by allowing you to set global prompts
that will be used for all prompt segments.

[Structure Prompts (checkbox)]:
Enable/Disable the structure prompts feature.
[Global Prompt (text)]:
Here write the prompt that you wish to be used for all prompt segments.
Multi-Prompt:
This feature allows you to control the audio, adding variation to different time segments.
You have up to 10 prompt segments. the first prompt will always be 10s long
the other prompts will be [10s - overlap].
for example if the overlap is 2s, each prompt segment will be 8s.

[Prompt Segments (number)]:
Amount of unique prompt to generate throughout the audio generation.
[Prompt/Input Text (prompt)]:
Here describe the audio you wish the model to generate.
[Repeat (number)]:
Write how many times this prompt will repeat (instead of wasting another prompt segment on the same prompt).
[Time (text)]:
The time of the prompt segment.
[Calculate Timings (button)]:
Calculates the timings of the prompt segments.
[Duration (number)]:
How long you want the generated audio to be (in seconds).
[Overlap (number)]:
How much each new segment will reference the previous segment (in seconds).
For example, if you choose 2s: Each new segment after the first one will reference the previous segment 2s
and will generate only 8s of new audio. The model can only process 10s of music.
[Seed (number)]:
Your generated audio id. If you wish to generate the exact same audio,
place the exact seed with the exact prompts
(This way you can also extend specific song that was generated short).
[Random Seed (button)]:
Gives "-1" as a seed, which counts as a random seed.
[Copy Previous Seed (button)]:
Copies the seed from the output seed (if you don't feel like doing it manualy).
Audio Tab:
[Input Type (selection)]:
File mode allows you to upload an audio file to use as input
Mic mode allows you to use your microphone as input
[Trim Start and Trim End (numbers)]:
Trim Start set how much you'd like to trim the input audio from the start
Trim End same as the above but from the end
[Input Audio (audio file)]:
Input here the audio you wish to use.
Customization Tab:
[Background Color (color)]:
Works only if you don't upload image. Color of the background of the waveform.
[Bar Color Start (color)]:
First color of the waveform bars.
[Bar Color End (color)]:
Second color of the waveform bars.
[Background Image (image)]:
Background image that you wish to be attached to the generated video along with the waveform.
[Height and Width (numbers)]:
Output video resolution, only works with image.
(minimum height and width is 256).
Settings Tab:
[Output Audio Channels (selection)]:
With this you can select the amount of channels that you wish for your output audio.
mono is a straightforward single channel audio
stereo is a dual channel audio but it will sound more or less like mono
stereo effect this one is also dual channel but uses tricks to simulate a stereo audio.
[Output Audio Sample Rate (dropdown)]:
The output audio sample rate, the model default is 32000.
[Top-k (number)]:
is a parameter used in text generation models, including music generation models. It determines the number of most likely next tokens to consider at each step of the generation process. The model ranks all possible tokens based on their predicted probabilities, and then selects the top-k tokens from the ranked list. The model then samples from this reduced set of tokens to determine the next token in the generated sequence. A smaller value of k results in a more focused and deterministic output, while a larger value of k allows for more diversity in the generated music.
[Top-p (number)]:
also known as nucleus sampling or probabilistic sampling, is another method used for token selection during text generation. Instead of specifying a fixed number like top-k, top-p considers the cumulative probability distribution of the ranked tokens. It selects the smallest possible set of tokens whose cumulative probability exceeds a certain threshold (usually denoted as p). The model then samples from this set to choose the next token. This approach ensures that the generated output maintains a balance between diversity and coherence, as it allows for a varying number of tokens to be considered based on their probabilities.
[Temperature (number)]:
is a parameter that controls the randomness of the generated output. It is applied during the sampling process, where a higher temperature value results in more random and diverse outputs, while a lower temperature value leads to more deterministic and focused outputs. In the context of music generation, a higher temperature can introduce more variability and creativity into the generated music, but it may also lead to less coherent or structured compositions. On the other hand, a lower temperature can produce more repetitive and predictable music.
[Classifier Free Guidance (number)]:
refers to a technique used in some music generation models where a separate classifier network is trained to provide guidance or control over the generated music. This classifier is trained on labeled data to recognize specific musical characteristics or styles. During the generation process, the output of the generator model is evaluated by the classifier, and the generator is encouraged to produce music that aligns with the desired characteristics or style. This approach allows for more fine-grained control over the generated music, enabling users to specify certain attributes they want the model to capture.
rundiffusion commented 1 year ago

@jasonsprouse Thanks for the ping! Yeah! We love AudioCraft! We have it in alpha right now on RunDiffusion.com (along with many other open source apps). We'd love a mention on your GitHub that we have running! Lastly, is there anything we can do for you to help with an exciting rollout?

jasonsprouse commented 1 year ago

@rundiffusion Hey no problem!. Like what y'all are doing. Free Marketing!!!!!

We'd love a mention on your GitHub that we have running!

On my github?

Lastly, is there anything we can do for you to help with an exciting rollout?

Exploring different Ai and blockchain tech. Not really working on the kind of product y'all host. I am working on something pretty exceptional though. Thanks for asking.

@GrandaddyShmax If we could get a wiki documentation on your github to reference that'd be awesome. Keeps from jumping tabs or whatnot.

GrandaddyShmax commented 1 year ago

Sure thing, did not think about this option before. thought that the wiki would be most handy in the same webui

rundiffusion commented 1 year ago

@jasonsprouse @GrandaddyShmax I messed up on following up with you guys. I'm sorry! AudioCraft Plus is officially launched on RunDiffusion!

Yes! Free marketing! We launched Enfugue (app) last week and boosted their traffic and downloads 33%. Plus lots of people who don't have access to GPUs can just try out audiocraft without any hassle.

Where else can we announce the collaboration?

What I mean on the GitHub thing was it would be cool to put us in the Readme.md on the main repo landing page.

Excited to chat more.