SeanTater / albemarle

Semantic search and topic modeling for zombies
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Tokenization #1

Open SeanTater opened 8 years ago

SeanTater commented 8 years ago
>>> Tokens.toWords "I am the very model of a modern major general.."
["I", "am", "the", "very", "model", "of", "a", "modern", "major", "general", ".."]

The following text, as mentioned by He and Kayaalp, 2006:

Independent of current body composition, IGF-I levels at 5 yr were significantly associated with rate of weight gain between 0-2 yr (beta = 0.19; P < 0.0005), and children who showed postnatal catch-up growth (i.e. those who showed gains in weight or length between 0-2 yr by >0.67 SD score) had higher IGF-I levels than other children (P = 0.02).

We expect to render as the following, where spaces delimit tokens:

Independent of current body composition , IGF - I levels at 5 yr were significantly associated with rate of weight gain between 0 - 2 yr ( beta = 0.19 ; P < 0.0005 ) , and children who showed postnatal catch - up growth ( i.e. those who showed gains in weight or length between 0 - 2 yr by > 0.67 SD score ) had higher IGF - I levels than other children ( P = 0.02 ) .

SeanTater commented 8 years ago

After looking at the ICU concept of tokens I think probably it's wiser to split on hyphens as well (which is counter to before)