Open eigenfoo opened 5 years ago
given that work in the field has been using Wikipedia, I think we can not go wrong with using it. Otherwise I do not know where we can get a corpus that will always be this easy to scale up.
I have no idea why it is zipped, we did that when we were working on the shell scripts.
re the name of data_builder, what do you think we should rename it to.
OK, I agree that we should use Wikipedia: we should just use a small subset of the corpus if we like. I wasn't aware it was a standard corpus in the field.
I've committed the shell script separately, but the zip file now has the shell script "inside it". to be clean I think we should fix that.
The only functionality of data_builder.py
is to convert raw text into numbers, right? I would say featurize_text.py
or featurize_corpus.py
would be more explicit? Right now I only know that the script "builds my data", which could mean literally anything...
We now need a reasonable corpus for us to experiment on. In the interest of time, I am tempted to not use the Wikipedia corpus (or at least not all of it), since that is too large.
data_builder.py
) to convert that into the ingestible.txt
format data, and also output a dictionary that maps tokens to integer ids.