eigenfoo-archives / distributional-embeddings

Word representations via distributional embeddings
2 stars 0 forks source link

TODO: get data #9

Open eigenfoo opened 5 years ago

eigenfoo commented 5 years ago

We now need a reasonable corpus for us to experiment on. In the interest of time, I am tempted to not use the Wikipedia corpus (or at least not all of it), since that is too large.

Jped commented 5 years ago

given that work in the field has been using Wikipedia, I think we can not go wrong with using it. Otherwise I do not know where we can get a corpus that will always be this easy to scale up.

I have no idea why it is zipped, we did that when we were working on the shell scripts.

re the name of data_builder, what do you think we should rename it to.

eigenfoo commented 5 years ago

OK, I agree that we should use Wikipedia: we should just use a small subset of the corpus if we like. I wasn't aware it was a standard corpus in the field.

I've committed the shell script separately, but the zip file now has the shell script "inside it". to be clean I think we should fix that.

The only functionality of data_builder.py is to convert raw text into numbers, right? I would say featurize_text.py or featurize_corpus.py would be more explicit? Right now I only know that the script "builds my data", which could mean literally anything...