Cache corpus cleaning step

lauralorenz commented 8 years ago

Separate corpus cleaning from the generate folds step so that it is actually a preprocess step for our normal workflow. This way we can have a clean corpus that we port around without having to redo it every time we generate a new folds situation.

ayota commented 8 years ago

yo @lauralorenz , in an ideal world, what would the input to the generate folds step be :thinking: ? I was thinking of having clean_corpus.py output an np.array, so then this generate folds script could pick up at getting the length for number_of_sentences (pasted below for reference) ...

def generate_word2vec_folds(corpus='Empty', folds=3, seed=10, min_sentence_length=10):
    '''
    Generates a series of text files that each represent a training or test split of the text data.  Since word2vec does
    not conduct any calculations that rely on interactions across sentence boundaries this cross-validation k-fold generator
    splits the text by sentence and then chooses random sentences together into the same corpus.
    :param corpus: entire corpus to work on
    :type corpus: bytearray, str, or mixed bytes and str
    :param folds: how many train/test folds to make
    :type folds: int
    :param seed: random seed for the random number used to make the folds
    :type seed: int
    :param min_sentence_length: minimum sentence length that is considered valid
    :type min_sentence_length: int
    :return:
    '''
    # Tokenize the corpus into sentences because we need to get a random sample of sentences from the resulting list.
    tokenized_corpus=tokenize_sentences(corpus)
    #tokenize the corpus into sentences because we need to get a random sample of sentences from the resulting list.
    cleaned_corpus= clean_corpus(corpus) #remove random characters from corpus

    tokenized_corpus= tokenize_sentences(cleaned_corpus) #split into sentences

    tokenized_corpus= validate_sentences(tokenized_corpus, min_sentence_length) #keep only sentences that are >= min_sentence length, start with capital

    tokenized_corpus=np.array(tokenized_corpus)
# >>>> delete everything before this line, make the filename a param and import the tokenized 
# corpus from a .npy file or however would require minimal adjustments <<<<
    number_of_sentences=len(tokenized_corpus)

    kf = KFold(n=number_of_sentences, n_folds=folds, shuffle=True, random_state=seed)

    corpus_split = []
    for train_index, test_index in kf:
        corpus_split.append({'train':tokenized_corpus[train_index], 'test':tokenized_corpus[test_index]})

    return corpus_split

lauralorenz commented 8 years ago

Yeah, or maybe a newline delimited text so it's easy to read in. I don't love pickles because they're very version specific and I don't love the idea of a library dependent format (I believe numpy has some sort of storage format), especially if our array is 1-D anyways so we don't get a ton of advantage from the special numpy format anyways. But I could be convinced on the latter point if you want.

I'm also pro newline delimited text option because I think in the end/maybe soon we're going to have to do both this step and the former step in a streaming fashion to make this computationally manageable against different corpus sizes.

ayota commented 8 years ago

good point, will output newline delimited text!

ayota / ddl_nlp

Cache corpus cleaning step #54