This adds a new pipeline, lex_chinese, which works similarly to lex but for Chinese. It extracts the following features:
Narrative length: Number of sentences, number of characters, and mean sentence length.
Frequency metrics: Type-token ratio, mean and median word frequencies.
POS counts: For each part-of-speech category, the number of it in the utterance and ratio of it divided by the number of tokens. Also includes some special ratios such as pronoun / noun and noun / verb ratios.
Tree statistics: Max, median, and mean heights of all CFG parse trees in the narration.
CFG counts: Number of occurrences for each of the 60 most common CFG production rules from the constituency parse tree.
It requires the file top_chinese_cfg.txt (a copy can be found here) to be uploaded as an external dependency in order to work.
This adds a new pipeline,
lex_chinese
, which works similarly tolex
but for Chinese. It extracts the following features:It requires the file top_chinese_cfg.txt (a copy can be found here) to be uploaded as an external dependency in order to work.