cldf / segments

Unicode Standard tokenization routines and orthography profile segmentation
Apache License 2.0
31 stars 13 forks source link

Add some context info to profile creation #18

Closed xrotwang closed 6 years ago

xrotwang commented 7 years ago

It would be informative if an orthography profile created from text would also show all possible symbols appearing before or after, maybe as columns preceding and following, possibly using ^ and $ to mark initial or final?

bambooforest commented 7 years ago

Yes it would. In fact, this is the approach taken by Michael, e.g.

https://github.com/unicode-cookbook/recipes/blob/master/Dutch/code/nld10K_corrected_profile.tsv

using the "Left" and "Right" column keywords. The context information is necessary in some more tricky orthographic parsing, like Dutch. No additional symbols are used to mark initial or final, but perhaps an additional column with that information may be helpful?

In the initial OP creation, the R implementation also adds the columns CodePoint and UnicodeName, to help disambiguate lookalikes and such.

xrotwang commented 7 years ago

Ah, I see. So I guess what I had in mind for the initial OP creation would then look like:

Left Grapheme Right
^        a    b c d e
b c d e  a    b c d e
b c d e  a    $

for a grapheme a that appears initial followed by one of bcde, ...

xrotwang commented 7 years ago

The b c d e should actually go into differently named columns, because I'd only suggest to take initial and final position into account when transforming.

xrotwang commented 7 years ago

As far as I can tell, being able to treat initials and finals separately would only require keeping track of position here: https://github.com/bambooforest/segments/blob/master/segments/tokenizer.py#L381

bambooforest commented 7 years ago

So you want to keep track just of whether the grapheme is word initial | final? Otherwise, don't we have to keep track of the state of all environments across all words in the input to do something like the above b c d e ? This could get quite large.

xrotwang commented 7 years ago

Yes, only initial or final. The b c d e thing I'd only put into the profile as additional information gathered when creating it. So this should not go into a Left or Right column which is supposed to be taken into account when segmenting.

bambooforest commented 7 years ago

Sounds ok and potentially useful for lexibank, I'm assuming. We do say in the book that our R and Python implementations will differ.

LinguList commented 7 years ago

I guess, the more I think about this, the more we need it. I just tested a VERY SIMPLE workaround, that enalbes me to use the feature already:

So even with this, we can already use this feature, but we could consider using it in the specification, by adding '^->NULL' and '^->$' automatically to the profile. Problem is of course, that we need to know where to use this information, as we don't know the rest of the columns somebody wants to use, right?

>>> from segments.tokenizer import *
>>> mylines = [{'Grapheme': 'th', 'IPA': 'tH'},
 {'Grapheme': '^', 'IPA': 'NULL'},
 {'Grapheme': '$', 'IPA': 'NULL'},
 {'Grapheme': 'a', 'IPA': 'b'},
 {'Grapheme': '^a', 'IPA': 'A'}]
>>> prf = Profile(*mylines)
>>> t = Tokenizer(prf)
>>> t.transform('th')
'th'
>>> t.transform('th', 'IPA'
'tH'
>>> t.transform('^ath', 'IPA') 
'A tH'
>>> t.transform('ath', 'IPA')
'b tH'
>>> t.transform('^atha$', 'IPA')
'A tH b'

But we could also just add another cookbook entry to lexibank, telling people how to adjust the init scripts to use this specific profile, right? Since this only involves a slightly expanded profile and a modified input string.

xrotwang commented 7 years ago

I'd say the canonical date for implementation of this feature is early May in Jena :)

LinguList commented 7 years ago

I'm asking myself, whether we actually even need this, given that the workaround is so simple?

LinguList commented 6 years ago

The workardound we have so far is:

this does the trick, maybe enough to simply document this online?

xrotwang commented 6 years ago

https://github.com/cldf/segments/blob/master/faq.md#how-to-treat-word-initial-or--final-in-a-special-way