BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data
ISC License
7 stars 2 forks source link

Pre-process data for BIDMach #2

Closed davclark closed 9 years ago

davclark commented 9 years ago

From John:

I suggest running the current XML tokenizers to get started. e.g. xmltweet. That produces files that BIDMach consumes directly.

No need to do regexp parsing in the tokenizer, we'll do it later.

The tweet tokenizer already has a regexp in it that should tokenize most emoticons.

@peparedes has run one (of many!) files through the tokenizer.

coryschillaci commented 9 years ago

Closing as duplicate of the tasks in issue 3.