geek-ai / Texygen

A text generation benchmarking platform
MIT License
863 stars 203 forks source link

Small vocab_size raises division by zero in DocEmbSim #32

Open remidomingues opened 5 years ago

remidomingues commented 5 years ago

Feeding the following real training dataset to a SeqGAN works perfectly:

X = np.random.randint(0, 20, (80, 20))

However, the following dataset with the same dimensionality but 6 symbols instead of 20 raises an error.

X = np.random.randint(0, 6, (80, 20))

In both cases, we used vocab_size = #unique symbols + 1, as suggested in text_process.text_precess(). Here is the corresponding traceback:

Traceback (most recent call last):
  File "texygen/texygen.py", line 85, in train
    gan_func(X)
  File "texygen/models/seqgan/Seqgan.py", line 331, in train_real
    self.evaluate()
  File "texygen/models/seqgan/Seqgan.py", line 80, in evaluate
    scores = super().evaluate()
  File "texygen/models/Gan.py", line 55, in evaluate
    score = metric.get_score()
  File "texygen/utils/metrics/DocEmbSim.py", line 33, in get_score
    return self.get_dis_corr()
  File "texygen/utils/metrics/DocEmbSim.py", line 164, in get_dis_corr
    return np.log10(corr / len(self.oracle_sim))
ZeroDivisionError: division by zero