Allegra-Cohen / grid

GNU General Public License v3.0
1 stars 3 forks source link

Corpus pre-processing #44

Closed Allegra-Cohen closed 1 year ago

Allegra-Cohen commented 1 year ago

Don't want to count on the user to split their corpus into documents. Thinking we should implement pre-processing at the sentence level that allows users to upload multiple files and, as a bonus, get out a row labels file where the labels are the filenames (or some type of header on each file where header = row label).

People may want to split their corpus into different segments, e.g. paragraphs instead of sentences. Let's assume that if they want to do so, they can either code the preprocessor themselves or will do it manually. If they do it themselves, we'll need to make sure the cleaning is the same.

Allegra-Cohen commented 1 year ago

Keith and I have added support for processing corpora into sentences and generating a default row labels file where labels are filenames.