Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Multithread feeder to reduce data loading times #335

Closed alexdemartos closed 5 years ago

alexdemartos commented 5 years ago

Hi,

I have a large corpus (>100K sentences), and the bottleneck right now is the loading of the data by the Feeder class. I am not experienced in TF, but I think this can be avoided by running multiple CPU threads while reading data by Feeder. Would it make sense to do something like this in feeder.py?

def start_threads(self, session):
  self._session = session
  for i in (range(0, NUM_THREADS)): # <-- this
    thread = threading.Thread(name='background', target=self._enqueue_next_train_group)
    thread.daemon = True #Thread will close when parent quits
    thread.start()
  thread = threading.Thread(name='background', target=self._enqueue_next_test_group)
  thread.daemon = True #Thread will close when parent quits
  thread.start()

Thanks!

apsears commented 5 years ago

Have you checked htop / your cpu usage? Does it show several cores idling? If I'm not mistaken, this repo already uses a FIFO Queue filled by 8 workers (by default), each running on a different process, potentially occupying a different core.

The FIFOQueue is not TF's most up to date idiom, but it does the job here. I've seen an issue in either this repo or a similar repo about moving from the FIFOQueue to the more modern 'Datasets generator' model, but again, that seems unlikely to solve your problem.

alexdemartos commented 5 years ago

Hi, thanks for your message. As far as I know, the FIFOQueue param capacity does not mean number of threads in background reading samples. In fact, I think with the current configuration the FIFOQueue does not start reading samples until the training has run out of them. Adding more threads (like showed above) and enlarging the FIFOQueue size allowed me to get rid of this problem.

alexdemartos commented 5 years ago

My problem was that the library I am using for g2p is very inneficient if you do it sample by sample. I've adapted feeder.py to apply g2p to all the data beforehand.