Closed lauralorenz closed 7 years ago
yo @lauralorenz , in an ideal world, what would the input to the generate folds step be :thinking: ? I was thinking of having clean_corpus.py output an np.array, so then this generate folds script could pick up at getting the length for number_of_sentences
(pasted below for reference) ...
def generate_word2vec_folds(corpus='Empty', folds=3, seed=10, min_sentence_length=10):
'''
Generates a series of text files that each represent a training or test split of the text data. Since word2vec does
not conduct any calculations that rely on interactions across sentence boundaries this cross-validation k-fold generator
splits the text by sentence and then chooses random sentences together into the same corpus.
:param corpus: entire corpus to work on
:type corpus: bytearray, str, or mixed bytes and str
:param folds: how many train/test folds to make
:type folds: int
:param seed: random seed for the random number used to make the folds
:type seed: int
:param min_sentence_length: minimum sentence length that is considered valid
:type min_sentence_length: int
:return:
'''
# Tokenize the corpus into sentences because we need to get a random sample of sentences from the resulting list.
tokenized_corpus=tokenize_sentences(corpus)
#tokenize the corpus into sentences because we need to get a random sample of sentences from the resulting list.
cleaned_corpus= clean_corpus(corpus) #remove random characters from corpus
tokenized_corpus= tokenize_sentences(cleaned_corpus) #split into sentences
tokenized_corpus= validate_sentences(tokenized_corpus, min_sentence_length) #keep only sentences that are >= min_sentence length, start with capital
tokenized_corpus=np.array(tokenized_corpus)
# >>>> delete everything before this line, make the filename a param and import the tokenized
# corpus from a .npy file or however would require minimal adjustments <<<<
number_of_sentences=len(tokenized_corpus)
kf = KFold(n=number_of_sentences, n_folds=folds, shuffle=True, random_state=seed)
corpus_split = []
for train_index, test_index in kf:
corpus_split.append({'train':tokenized_corpus[train_index], 'test':tokenized_corpus[test_index]})
return corpus_split
Yeah, or maybe a newline delimited text so it's easy to read in. I don't love pickles because they're very version specific and I don't love the idea of a library dependent format (I believe numpy has some sort of storage format), especially if our array is 1-D anyways so we don't get a ton of advantage from the special numpy format anyways. But I could be convinced on the latter point if you want.
I'm also pro newline delimited text option because I think in the end/maybe soon we're going to have to do both this step and the former step in a streaming fashion to make this computationally manageable against different corpus sizes.
good point, will output newline delimited text!
Separate corpus cleaning from the generate folds step so that it is actually a preprocess step for our normal workflow. This way we can have a clean corpus that we port around without having to redo it every time we generate a new folds situation.