CR-Gjx / LeakGAN

The codes of paper "Long Text Generation via Adversarial Training with Leaked Information" on AAAI 2018. Text generation using GAN and Hierarchical Reinforcement Learning.
https://arxiv.org/abs/1709.08624
576 stars 180 forks source link

A few more instructions in the README #8

Closed waynethewizard closed 6 years ago

waynethewizard commented 6 years ago

Would you mind presenting some introductory instructions in your readme about how to load new data into synthetic data for training? I would like to replicate your results then introduce new data. Thank you for this resource.

CR-Gjx commented 6 years ago

If you want to use real-world dataset, it better to use codes of "Image Coco" folder, You only modify the "realtrain_cotra.txt".

Crista23 commented 6 years ago

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?

CR-Gjx commented 6 years ago

For real data in "Image Coco" folder, I create a vocabulary dictionary for every word in vocab_cotra.pkl firstly, and then every word in the dataset will be transformed to number according to the dictionary. Specifically, every sentence in the dataset is aligned to 20-length, if one sentence's length less than 20, some paddings (blank) will be added up to 20 and the padding is a special token in the dictionary.

AranKomat commented 6 years ago

According to realtrain_cotra.txt, there are 32 tokens per line, and some lines are occupied by more than 20 non-1814 tokens, assuming 1814 here means zero padding. So, I assume you meant "32-length" rather than "20-length."

In vocab_contra.pkl, p4801 aS'OTHERPAD' is the last entry with ' ', so there are only 4801 vocabs for COCO. But main.py says the vocab size is 4839, which doesn't agree. realtrain_contra.txt says 0 is also used as a token (in a middle of a sentence), but it didn't appear in vocab_contra. Since 0 was designated to be a start token, I believe it cannot be used in a middle of a sentence. According to real_traincontra.txt, it seems 65 stands for 'A', but according to vocab_contra, 'A' is at 67. Likewise, '.' (period) is 193 according to real_traincontra, but it's 194 in vocab_contra. By the way, does 'OTHERPAD' mean zero padding (instead of 1814)? In vocab_contra, there's this line:

p194 aS'.' aS'much'

which means 194 corresponds to both '.' and 'much'. So, I believe your vocab_contra is inaccurate. Or is it not?

CR-Gjx commented 6 years ago

In fact, when I write the main.py ,I write a bigger vocab number to prevent vocabs-overflow. Maybe it is not rigorous. But with training, some tokens' probabilities become 0 because they never happened in training dataset. As you say, aS'OTHERPAD' is a common word and a blank. In my code, I assume Generator network can only generate fixed length sentences, so I add this token to guarantee all sentences are fixed length in the dataset. But some sentences are so short that aS'OTHERPAD' appears many times.

CR-Gjx commented 6 years ago

For the last question, it may my bugs during uploading the codes. I will verify it and fix. Thanks for your reminds.

AranKomat commented 6 years ago

I did print(word) print(vocab) in convert.py, then I found that '.' and 'much' are attributed differently and appropriately. So, I guess this is due to a bug that occurs when one opens .pkl file like txt file. I found that '0' corresponds to 'raining,' so it has nothing to do with the start token. A few sentences from realtrain_contra.py were translated nicely with convert.py, so I guess there's no problem at all. Sorry for confusion.

CR-Gjx commented 6 years ago

OK, it may be .pkl's bug. Thanks for your discovery.

bharathreddy1997 commented 4 years ago

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed? Hi did you understand how the pickle file was generated?