Closed xrotwang closed 6 years ago
Yes it would. In fact, this is the approach taken by Michael, e.g.
https://github.com/unicode-cookbook/recipes/blob/master/Dutch/code/nld10K_corrected_profile.tsv
using the "Left" and "Right" column keywords. The context information is necessary in some more tricky orthographic parsing, like Dutch. No additional symbols are used to mark initial or final, but perhaps an additional column with that information may be helpful?
In the initial OP creation, the R implementation also adds the columns CodePoint and UnicodeName, to help disambiguate lookalikes and such.
Ah, I see. So I guess what I had in mind for the initial OP creation would then look like:
Left Grapheme Right
^ a b c d e
b c d e a b c d e
b c d e a $
for a grapheme a
that appears initial followed by one of bcde
, ...
The b c d e
should actually go into differently named columns, because I'd only suggest to take initial and final position into account when transforming.
As far as I can tell, being able to treat initials and finals separately would only require keeping track of position here: https://github.com/bambooforest/segments/blob/master/segments/tokenizer.py#L381
So you want to keep track just of whether the grapheme is word initial | final? Otherwise, don't we have to keep track of the state of all environments across all words in the input to do something like the above b c d e
? This could get quite large.
Yes, only initial or final. The b c d e
thing I'd only put into the profile as additional information gathered when creating it. So this should not go into a Left
or Right
column which is supposed to be taken into account when segmenting.
Sounds ok and potentially useful for lexibank, I'm assuming. We do say in the book that our R and Python implementations will differ.
I guess, the more I think about this, the more we need it. I just tested a VERY SIMPLE workaround, that enalbes me to use the feature already:
So even with this, we can already use this feature, but we could consider using it in the specification, by adding '^->NULL' and '^->$' automatically to the profile. Problem is of course, that we need to know where to use this information, as we don't know the rest of the columns somebody wants to use, right?
>>> from segments.tokenizer import *
>>> mylines = [{'Grapheme': 'th', 'IPA': 'tH'},
{'Grapheme': '^', 'IPA': 'NULL'},
{'Grapheme': '$', 'IPA': 'NULL'},
{'Grapheme': 'a', 'IPA': 'b'},
{'Grapheme': '^a', 'IPA': 'A'}]
>>> prf = Profile(*mylines)
>>> t = Tokenizer(prf)
>>> t.transform('th')
'th'
>>> t.transform('th', 'IPA'
'tH'
>>> t.transform('^ath', 'IPA')
'A tH'
>>> t.transform('ath', 'IPA')
'b tH'
>>> t.transform('^atha$', 'IPA')
'A tH b'
But we could also just add another cookbook entry to lexibank, telling people how to adjust the init scripts to use this specific profile, right? Since this only involves a slightly expanded profile and a modified input string.
I'd say the canonical date for implementation of this feature is early May in Jena :)
I'm asking myself, whether we actually even need this, given that the workaround is so simple?
The workardound we have so far is:
this does the trick, maybe enough to simply document this online?
It would be informative if an orthography profile created from text would also show all possible symbols appearing before or after, maybe as columns
preceding
andfollowing
, possibly using^
and$
to mark initial or final?