Open thatbudakguy opened 2 years ago
ultimately this could just be another function of the Phonemizer
— when the output of the model is just a vector, it's up to the component how to translate that information into phonological data. we could have a new component type that sets phonological properties on tokens, or we could just make this a method available on the Token itself, so that the downstream consumer can request both the phonological features or the phonemes themselves from the same source data.
this becomes synonymous with the existing phonemizer
as part of #24; we should rename it Phonologizer
accordingly.
also with #22 we should make both components respect overwrite/extend config options (as spacy builtins do) so that they can work together in concert: rule-based runs first, then the statistical version runs and fills in all the gaps (e.g. polyphones).
this would more properly be called the
Phonologizer
, and it could borrow heavily from spaCy'sMorphologizer
. see for reference Wikipedia on "distinctive features".