HilaManor / AudioEditingCode

https://hilamanor.github.io/AudioEditing/
139 stars 22 forks source link

No effective results #7

Open hiroki1953 opened 2 months ago

hiroki1953 commented 2 months ago

Nice to meet you. I was very interested in this research and tried it myself.

I am executing the following two commands, but I am not seeing any significant changes from the input values. (I want to change the melody.)

I have read the source code and tried to devise some ideas, but what argument values ​​should I set to get better results?

Please give me some advice.

python main_pc_extract_inv.py --source_prompt "A high quality recording of wind instruments and strings playing. " --target_neg_prompt "low quality" --init_aud "../sample_audio/MDDBBeethoven.wav" --model_id "cvssp/audioldm2-music" --results_path "../result" --n_evs 3

python apply_drift.py --extraction_path "../result/audioldm2-music/MDDBBeethoven/pmt_A_high_quality_recording_of_wind_instruments_and_strings_playing. __neg__low_quality/sNone_pc-both_cfgd3_driftNone-None_it50_c1.0e-03_1723454811.pt" --drift_ start 0 --drift_end 50 --amount 1 --evs 1 2 3 --combine_evs

HilaManor commented 2 months ago

Hi, thanks for the issue. In diffusion the generation process starts from a high-value timestep (e.g., 200) and ends in 0. Choosing the starting timestep (drift_start, in your example 0) to be lower than the ending timestep (drift_end, 50), results in no editing at all.

Try just swapping between the 0 and the 50 in you apply_drift script. Then you should start hearing some changes :)
Then, to change the amount of change, try playing with the timestep, e.g., 150->50, would change more.

I'll add a value check to raise an error to prevent people getting confused by this, thanks!

Edit: added on commit c369ad3

hiroki1953 commented 2 months ago

Thank you. After various trials, I was able to reproduce the edited melody. I have one question: what is the appropriate value for the amount? When I tried it several times with 1.0, I didn't see any significant changes, but when I raised it to 100, I saw a significant change.

HilaManor commented 2 months ago

Since it's an unsupervised method, there is not "appropriate" value, and it's a bit of a trial-and-error process to find a value that you are satisfied with.

In the first method of adding the PCs (which you now use), a different PC is added for each timestep in the time-range, I generally saw that amount=-40,40 yields a significant change. In the second method, where the same PC is added for each timestep in the time-range (by setting the --use_specific_ts_pc), using amount=-2,2 was already significant. The difference is that the first method changes multiple elements (different PCs), whereas the second changes more strongly a specific element across all timesteps which accumulates.

The change is added in the latent space of the model, which means that if we add too much it will start to deteriorate the quality of the results, but I haven't found when this will happen (I didn't try above the 2/40 for the 2 methods respectively with audio).