direct-phonology / phoNy

phonology in spaCy!
MIT License
0 stars 0 forks source link

refactor phonemizer to operate at Doc level #16

Open thatbudakguy opened 2 years ago

thatbudakguy commented 2 years ago

spacy's general design philosophy is that the Doc owns the data and Spans and Tokens are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align to Tokens (for which we could maybe even employ Alignment).

thatbudakguy commented 2 years ago

another possibility is leaving out non-phonetic tokens entirely and using an Alignment, so that e.g.:

doc.text
>>> "北冥有魚,其名為鯤。"
doc._.phonemes
>>> "pok meang hjuwX ngjo tshen mjieng sjew kwon"
doc[4].text
>>> ","
doc[4]._.phonemes
>>> None
thatbudakguy commented 2 years ago
thatbudakguy commented 2 years ago

might need to subclass Alignment to allow for null/dangling tokens, if we care about that. docs say:

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you’ll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

thatbudakguy commented 2 years ago

phonologizer