TODO: get data - Githubissues

eigenfoo-archives / distributional-embeddings

Word representations via distributional embeddings

2 stars 0 forks source link

TODO: get data #9

Open eigenfoo opened 5 years ago

eigenfoo commented 5 years ago

We now need a reasonable corpus for us to experiment on. In the interest of time, I am tempted to not use the Wikipedia corpus (or at least not all of it), since that is too large.

[x] Decide on a corpus
[ ] Write a shell script to preprocess it (i.e. strip punctuation, etc.). We already have a shell script like this. Why is it zipped?
[x] Run a python script (what is currently the ill-named data_builder.py) to convert that into the ingestible .txt format data, and also output a dictionary that maps tokens to integer ids.
[ ] Zip the data set and commit it to this repo, for reproduceability. Separately, commit the shell scripts and python scripts.

Jped commented 5 years ago

given that work in the field has been using Wikipedia, I think we can not go wrong with using it. Otherwise I do not know where we can get a corpus that will always be this easy to scale up.

I have no idea why it is zipped, we did that when we were working on the shell scripts.

re the name of data_builder, what do you think we should rename it to.

eigenfoo commented 5 years ago

OK, I agree that we should use Wikipedia: we should just use a small subset of the corpus if we like. I wasn't aware it was a standard corpus in the field.

I've committed the shell script separately, but the zip file now has the shell script "inside it". to be clean I think we should fix that.

The only functionality of data_builder.py is to convert raw text into numbers, right? I would say featurize_text.py or featurize_corpus.py would be more explicit? Right now I only know that the script "builds my data", which could mean literally anything...