Formatting Custom Training Data & ValueError: bucket_boundaries must not be empty

dbarroso1 commented 6 years ago

Hello, ive been trying to make my own Training data, but there doesnt seem to be a ton of resources on how the data should be formatted. Ive compared the LJ001 Data and tried to imitate it, including splitting wavs, and the transcript.csv.

I have tested train.py with the LJ001 Data and the trainer works, but when i try with my Data it fails, giving me this error:

Traceback (most recent call last):
  File "train.py", line 96, in <module>
    g = Graph(); print("Training Graph loaded")
  File "train.py", line 33, in __init__
    self.x, self.y, self.z, self.fnames, self.num_batch = get_batch()
  File "C:\Users\...\tacotron-master\data_load.py", line 116, in get_batch
    dynamic_pad=True)
  File "C:\anaconda3\envs\...\training\bucket_ops.py", line 374, in bucket_by_sequence_length
    raise ValueError("bucket_boundaries must not be empty")
ValueError: bucket_boundaries must not be empty

Here is an example of the CSV File, i tried matching the ID, TEXT, LENGTH Format.

SM001-0001|Oh happy fourth of July America|00:00:02
SM001-0002|Ready to fire up the grill and celebrate our victory over the Brits|00:00:03
SM001-0003|Well, I'm not|00:00:01
SM001-0004|Because despite that incredibly convincing American accent, I'm one of those Brits|00:00:04
SM001-0005|now I've acted in film and TV for years|00:00:02
SM001-0006|but my greatest performance is acting like I don't care that every summer you gobble down tube sausages and celebrate kicking our arses|00:00:07
SM001-0007|Or butts as you say incorrectly|00:00:02
SM001-0008|Do you really still have to celebrate your emancipation from us|00:00:02
SM001-0009|I mean that's like your girlfriend breaking up with you and then celebrating with fireworks|00:00:04
SM001-0010|every year for 300 years|00:00:03
SM001-0011|it gets my goat|00:00:01
SM001-0012|but what really gets my goat is imagining how great America would be if we were still in charge|00:00:04
SM001-0013|Oh America if we'd won the war you'd have better comedy news TV programs and way better rude words|00:00:07
SM001-0014|Oh I'm talking fanny, trollop, minger tar, Minjbag, bleeding, sodding, blooming, cocked up, get stuffed|00:00:06
SM001-0015|and of course wanker|00:00:01
SM001-0016|imagine how sophisticated you'd say when you're insulting someone|00:00:03 
SM001-0017|Oh Brad your wife's a slag don't piss off your wanker|00:00:04
SM001-0018|see how classy that sounded with our accents and your American self-confidence you'd be unstoppable|00:00:05
SM001-0019|yeah you'd have to pay a few more taxes but you can't put a price on that|00:00:03
SM001-0020|Great Britain two would be the greatest country on Earth|00:00:02
SM001-0021|your lawyers would all wear powdered wigs so criminals really respect them|00:00:04
SM001-0022|and you'd have all the mushy peas you can stuff down your bloody great gobs|00:00:03
SM001-0023|oh and if you get sick you don't need to worry about medical insurance because with a National Health Service a doctor will see you for free in about two years|00:00:08
SM001-0024|plus your taxes will be spent on things you really need like a royal family who do the tough jobs no one else wants to do|00:00:06
SM001-0025|like being driven around in a really nice car while waving|00:00:04
SM001-0026|you'all want to eat some apple pie then shoot some hoops and have hoedown|00:00:03

So tldr two questions:

Why am i receiving this bucket_boundaries must not be empty Error when python finds the CSV and can read it.
Based on 1's answer how can i properly format my data to work with the neural network

djebel-amila commented 5 years ago

Hi @dbarroso1 , Did you eventually find how to format your data? I’m at the same stage. I couldn’t figure how to properly do it but duplicating the transcript.csv file from the LJ dataset and carefully pasting in my own dataset, sentence by sentence, did the trick. Not a particularly sustainable or elegant solution…

aryachiranjeev commented 4 years ago

I am also facing this bucket error I checked the maxlen(151) and minlength (149) that's why in for loop there is no iteration , so there is no value in bucket . If anyone solved this problem kindly help me in solving this issue

ST2-EV commented 4 years ago

Hey consider using this GUI to make the datset

Kyubyong / tacotron

Formatting Custom Training Data & ValueError: bucket_boundaries must not be empty #113