Where to obtain dataset?

albertqjiang / MohaSimulator

A simulator of Haas' responses to basically any text

0 stars 0 forks source link

Where to obtain dataset? #1

Open goldsail opened 6 years ago

goldsail commented 6 years ago

First, to locate and label the haas sentence from the text you download from web, you should be able to train a classifier, which tells any sentence a Haas's or not. But this also requires dataset...

albertqjiang commented 6 years ago

We can regard this problem as a dialogue system design problem. I can't think of any perfectly viable way of acquiring the data. But this wiki page provides a good starting point: https://zh.wikipedia.org/wiki/%E8%86%9C%E8%9B%A4%E6%96%87%E5%8C%96

goldsail commented 6 years ago

It requires a bunch of corpus to train an autoencoder. Plus, I have little knowledge about NLP. What I know about NLP is to convert the text to a sequence of vectors using word2vec, and then use sequential models.

In CV, the GAN is a good approach to sample generation.

albertqjiang commented 6 years ago

word2vec is the first step. Let's simply gather some data and just put them in an LSTM to get a feeling of it. GAN could be used for NLP as well.