Open devlux76 opened 1 year ago
I believe I've narrowed down the problem.
In https://github.com/AIGC-Audio/AudioGPT/blob/main/audio-chatgpt.py you have the following code
class T2S:
def __init__(self, device= None):
from inference.svs.ds_e2e import DiffSingerE2EInfer
if device is None:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Initializing DiffSinger to %s" % device)
self.device = device
self.exp_name = 'checkpoints/0831_opencpop_ds1000'
self.config= 'NeuralSeq/egs/egs_bases/svs/midi/e2e/opencpop/ds1000.yaml'
self.set_model_hparams()
self.pipe = DiffSingerE2EInfer(self.hp, device)
self.default_inp = {
'text': '你 说 你 不 SP 懂 为 何 在 这 时 牵 手 AP',
'notes': 'D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | D#4/Eb4 | rest | D#4/Eb4 | D4 | D4 | D4 | D#4/Eb4 | F4 | D#4/Eb4 | D4 | rest',
'notes_duration': '0.113740 | 0.329060 | 0.287950 | 0.133480 | 0.150900 | 0.484730 | 0.242010 | 0.180820 | 0.343570 | 0.152050 | 0.266720 | 0.280310 | 0.633300 | 0.444590'
}
def set_model_hparams(self):
set_hparams(config=self.config, exp_name=self.exp_name, print_hparams=False)
self.hp = hp
def inference(self, inputs):
self.set_model_hparams()
val = inputs.split(",")
key = ['text', 'notes', 'notes_duration']
try:
inp = {k: v for k, v in zip(key, val)}
wav = self.pipe.infer_once(inp)
except:
print('Error occurs. Generate default audio sample.\n')
inp = self.default_inp
wav = self.pipe.infer_once(inp)
#if inputs == '' or len(val) < len(key):
# inp = self.default_inp
#else:
# inp = {k:v for k,v in zip(key,val)}
#wav = self.pipe.infer_once(inp)
wav *= 32767
audio_filename = os.path.join('audio', str(uuid.uuid4())[0:8] + ".wav")
wavfile.write(audio_filename, self.hp['audio_sample_rate'], wav.astype(np.int16))
print(f"Processed T2S.run, audio_filename: {audio_filename}")
return audio_filename
class t2s_VISinger:
def __init__(self, device=None):
from espnet2.bin.svs_inference import SingingGenerate
if device is None:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Initializing VISingere to %s" % device)
tag = 'AQuarterMile/opencpop_visinger1'
self.model = SingingGenerate.from_pretrained(
model_tag=str_or_none(tag),
device=device,
)
phn_dur = [[0. , 0.219 ],
[0.219 , 0.50599998],
[0.50599998, 0.71399999],
[0.71399999, 1.097 ],
[1.097 , 1.28799999],
[1.28799999, 1.98300004],
[1.98300004, 7.10500002],
[7.10500002, 7.60400009]]
phn = ['sh', 'i', 'q', 'v', 'n', 'i', 'SP', 'AP']
score = [[0, 0.50625, 'sh_i', 58, 'sh_i'], [0.50625, 1.09728, 'q_v', 56, 'q_v'], [1.09728, 1.9832100000000001, 'n_i', 53, 'n_i'], [1.9832100000000001, 7.105360000000001, 'SP', 0, 'SP'], [7.105360000000001, 7.604390000000001, 'AP', 0, 'AP']]
tempo = 70
tmp = {}
tmp["label"] = phn_dur, phn
tmp["score"] = tempo, score
self.default_inp = tmp
def inference(self, inputs):
val = inputs.split(",")
key = ['text', 'notes', 'notes_duration']
try: # TODO: input will be update
inp = {k: v for k, v in zip(key, val)}
wav = self.model(text=inp)["wav"]
except:
print('Error occurs. Generate default audio sample.\n')
inp = self.default_inp
wav = self.model(text=inp)["wav"]
audio_filename = os.path.join('audio', str(uuid.uuid4())[0:8] + ".wav")
soundfile.write(audio_filename, wav, samplerate=self.model.fs)
return audio_filename
It looked a bit off to me, but I couldn't quite put my finger on it so I asked ChatGPT 4 about it.
It looks like the code is expecting inputs in a specific format and if the inputs are not in that format, it defaults to generating a default audio sample. The key point is in the try
/except
blocks in both inference
methods of the T2S
and t2s_VISinger
classes. If any exception is thrown during the execution of the code within the try
block, it immediately jumps to the except
block, which generates a default audio sample.
Here is what's happening in more detail:
The input is expected to be a string with comma-separated values. This string is split by commas to create a list, val
.
It then tries to create a dictionary inp
by zipping key
(which is a list of keys ['text', 'notes', 'notes_duration']) and val
(the list of values obtained from the input string).
It passes this inp
dictionary to the infer_once
function of the pipe
object in the T2S
class or to the model
object in the t2s_VISinger
class.
If anything goes wrong during this process (for example, if the input string does not contain enough comma-separated values to match up with the keys, or if the infer_once
function does not work with the provided inputs), an exception is raised.
As soon as an exception is raised, it jumps to the except
block and generates a default audio sample using self.default_inp
.
So, if you're always hearing the same output regardless of the input parameters, it's likely that an exception is being raised every time you try to pass in different parameters. The exception could be due to invalid input format or values, or due to some issue with the infer_once
method.
To fix this, you need to ensure that you're passing in inputs in the correct format and with valid values. If you're not sure what the correct format or values are, you'll need to look at the documentation or source code for the infer_once
method.
Also, you might want to modify the except
block to print out the exception message, like this:
except Exception as e:
print('Error occurs:', str(e))
print('Generate default audio sample.\n')
inp = self.default_inp
wav = self.pipe.infer_once(inp)
This will give you more information about what's going wrong when you pass in different parameters.
I gave it different lyrics, and even tried uploading a different audio file etc. But prompts that involve generating a singing voice always produce the exact same output as:
Either something is hardcoded somewhere or the model has been overfitted.