Topic Modelling - Githubissues

Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics that are distributed across a collection of documents.

This, in turn, is an important step in question-answer systems like Watson.

In brief, each document in a collection is converted to a Bag-of-Words (i.e. a set of word counts) and those word counts are scaled to decimals between 0 and 1, which may be thought of as the probability of a word occurring in the doc.

The scaled word counts are then fed into a deep-belief network, a stack of restricted Boltzmann machines, which themselves are just a subset of feedforward-backprop autoencoders. Those deep-belief networks, or DBNs, compress each document to a set of 10 numbers through a series of sigmoid transforms that map it onto the feature space.

Each document’s number set, or vector, is then introduced to the same vector space, and its distance from every other document-vector measured. Roughly speaking, nearby document-vectors fall under the same topic.

For example, one document could be the “question” and others could be the “answers,” a match the software would make using vector-space measurements.

deeplearning4j / deeplearning4j-examples

Topic Modelling #213