Open thatbudakguy opened 2 years ago
another possibility is leaving out non-phonetic tokens entirely and using an Alignment
, so that e.g.:
doc.text
>>> "北冥有魚,其名為鯤。"
doc._.phonemes
>>> "pok meang hjuwX ngjo tshen mjieng sjew kwon"
doc[4].text
>>> ","
doc[4]._.phonemes
>>> None
Doc._.phon
is the tensor representing all the phonological data for the docDoc._.phon_
is a string (analogous to doc.text) that passes the phonological data through the configured transcriptionToken._.phon
is a vector view into the doc phonological data for a single tokenToken._.phon_
is a string that transcribes a single tokenmight need to subclass Alignment
to allow for null/dangling tokens, if we care about that. docs say:
The current implementation of the alignment algorithm assumes that both tokenizations add up to the same string. For example, you’ll be able to align ["I", "'", "m"] and ["I", "'m"], which both add up to "I'm", but not ["I", "'m"] and ["I", "am"].
set_annotations
just set the annotations on the Doc
instead of the Token
sinitialize
?
get_aligned_phonemes
?example_from_phonemes_dict
to generate the correct alignment
?
Doc._.phon
is the tensor representing all the phonological data for the doc (floats2d
). iterating over it yields a single floats1d
per syllable (row)Doc._.syllables
is an iterator over Syllable
objects; one per row in Doc._.phon
Doc._.phon_
is a string (analogous to Doc.text
) that passes the phonological data through the configured transcription providerToken._.phon
is an aligned view into the doc phonological data for a single token (floats2d
), which is one or multiple syllables. iterating over it yields a single floats1d
per syllable (row)Token._.syllables
is an iterator over Syllable
objects; one per row in Token._.phon
Token._.phon_
is a string (analogous to Token.text
) that passes the phonological data through the configured transcription providerSpan._.phon
is an aligned view into the doc phonological data for a contiguous range of tokens (floats2d
), which is multiple syllablesSpan._.syllables
is an iterator over Syllable
objects; one per row in Span._.phon
Span._.phon_
is a string (analogous to Span.text
) that passes the phonological data through the configured transcription provider
och-g2p
)
spacy's general design philosophy is that the
Doc
owns the data andSpan
s andToken
s are just views of this data. it makes sense to replicate this, especially to handle cases where the phoneme data doesn't cleanly align toToken
s (for which we could maybe even employAlignment
).