fnielsen / ordia

Wikidata lexemes presentations
https://ordia.toolforge.org
Apache License 2.0
24 stars 13 forks source link

Build text-to-lexemes variant for phrases with N words #78

Open Daniel-Mietchen opened 4 years ago

Daniel-Mietchen commented 4 years ago

This would allow to better capture more complex constructs like matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (Q1792222).

Ideally, the user could set lower and upper bounds for N.

Daniel-Mietchen commented 3 years ago

One way to go about this would be to allow for certain non-text characters (e.g. something like %22) to be present in the input and remain, i.e. not being stripped away, much like dashes are already retained today.

fnielsen commented 3 years ago

I been thinking about newline as a possible hack for tokenizing