Closed davclark closed 9 years ago
From John:
I suggest running the current XML tokenizers to get started. e.g. xmltweet. That produces files that BIDMach consumes directly. No need to do regexp parsing in the tokenizer, we'll do it later. The tweet tokenizer already has a regexp in it that should tokenize most emoticons.
I suggest running the current XML tokenizers to get started. e.g. xmltweet. That produces files that BIDMach consumes directly.
No need to do regexp parsing in the tokenizer, we'll do it later.
The tweet tokenizer already has a regexp in it that should tokenize most emoticons.
@peparedes has run one (of many!) files through the tokenizer.
Closing as duplicate of the tasks in issue 3.
From John:
@peparedes has run one (of many!) files through the tokenizer.