refactor phonemizer to operate at Doc level

thatbudakguy commented 2 years ago

spacy's general design philosophy is that the Doc owns the data and Spans and Tokens are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align to Tokens (for which we could maybe even employ Alignment).

thatbudakguy commented 2 years ago

another possibility is leaving out non-phonetic tokens entirely and using an Alignment, so that e.g.:

doc.text
>>> "北冥有魚，其名為鯤。"
doc._.phonemes
>>> "pok meang hjuwX ngjo tshen mjieng sjew kwon"
doc[4].text
>>> "，"
doc[4]._.phonemes
>>> None

thatbudakguy commented 2 years ago

Doc._.phon is the tensor representing all the phonological data for the doc
Doc._.phon_ is a string (analogous to doc.text) that passes the phonological data through the configured transcription
Token._.phon is a vector view into the doc phonological data for a single token
Token._.phon_ is a string that transcribes a single token

thatbudakguy commented 2 years ago

might need to subclass Alignment to allow for null/dangling tokens, if we care about that. docs say:

The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you’ll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].

thatbudakguy commented 2 years ago

phonologizer

[ ] make set_annotations just set the annotations on the Doc instead of the Tokens
[ ] maybe update initialize?
training
[ ] maybe update get_aligned_phonemes?
[ ] update example_from_phonemes_dict to generate the correct alignment?
tokens
[ ] Doc._.phon is the tensor representing all the phonological data for the doc (floats2d). iterating over it yields a single floats1d per syllable (row)
[ ] Doc._.syllables is an iterator over Syllable objects; one per row in Doc._.phon
[ ] Doc._.phon_ is a string (analogous to Doc.text) that passes the phonological data through the configured transcription provider
[ ] Token._.phon is an aligned view into the doc phonological data for a single token (floats2d), which is one or multiple syllables. iterating over it yields a single floats1d per syllable (row)
[ ] Token._.syllables is an iterator over Syllable objects; one per row in Token._.phon
[ ] Token._.phon_ is a string (analogous to Token.text) that passes the phonological data through the configured transcription provider
[ ] Span._.phon is an aligned view into the doc phonological data for a contiguous range of tokens (floats2d), which is multiple syllables
[ ] Span._.syllables is an iterator over Syllable objects; one per row in Span._.phon
[ ] Span._.phon_ is a string (analogous to Span.text) that passes the phonological data through the configured transcription provider
data (och-g2p)
[ ] make sure you are generating valid training data here

direct-phonology / phoNy