Corpus pre-processing - Githubissues

Don't want to count on the user to split their corpus into documents. Thinking we should implement pre-processing at the sentence level that allows users to upload multiple files and, as a bonus, get out a row labels file where the labels are the filenames (or some type of header on each file where header = row label).

People may want to split their corpus into different segments, e.g. paragraphs instead of sentences. Let's assume that if they want to do so, they can either code the preprocessor themselves or will do it manually. If they do it themselves, we'll need to make sure the cleaning is the same.

Allegra-Cohen / grid

Corpus pre-processing #44