26medias / context-aware-markov-chains

Markov Chain combined with word vector embedding (word2vec) and part-of-speech tagging, for context-aware text generation. License: MIT
98 stars 9 forks source link

Can I use it for character generation? #8

Open aletote opened 5 years ago

aletote commented 5 years ago

I guess all I need is to put spaces between the characters on the text file dataset?

26medias commented 5 years ago

You would only need to update the way the text file is tokenized in the tokenize() method on line 27: https://github.com/26medias/context-aware-markov-chains/blob/master/cmarkov.js#L27

cues are the sentence splitters. tokens is the word plitter.

If you change var tokens = text.split(' '); to var tokens = text.split('');, you would split the text into chars.

However, it probably won't output anything of value, if anything.

The algorithm works by mapping the structure of the sentences: Positions of the verbs, adjectives, subjects, ... This is how it learns and reproduces the general style of the training text. If you split in chars instead of words, the POS (Part of Speech) tagging won't work, it won't be able to learn any style, and therefor it probably won't be able to output much.

The text generation is based on statistics rather than machine learning. During training a graph is made that maps the relationship between words, which is then used to generate the text. The output only makes sense because the POS is able to re-build a generally properly structured sentence, but without the POS, the output will probably be nonsense.

I would suggest looking at an LSTM instead, it will output much better results. https://github.com/tensorflow/tfjs-examples/tree/master/lstm-text-generation