A library for introducing state-of-the-art metrics on measuring linguistic complexity developed by ContentSide and CRITT at Kent State University.
LingX is:
How does LingX generally work?
LingX calculates different token-based and segment-based mono-bilingual complexity metrics. It internaly parses a given text into a dependency grammar graph. Using the graph and other linguistic information such as part-of-speech tagging, it can caculates different psycholinguistics, linguistic and translational process metrics. See the reference section for detailed information.
LingX uses Stanza state-of-the-arts NLP library for different language-based tasks. Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Stanza brings state-of-the-art NLP models to different languages.
The project is based on Stanza 1.2.1 and Python 3.6+. If you do not have Python 3.6, install it first. Then, in your favorite virtual environment, simply do:
pip install lingx
If you are running project in Jupyter Notebook or Google Colab enviroments run the following command instead:
!pip install lingx
Let's run a simple token-based psycholingual incomplete complexity theory (IDT) metric as a test. All you need to do is to make import related methods and codes as follows:
from lingx.utils import download_lang_models
from lingx.core.lang_model import get_nlp_object
from lingx.utils.lx import get_sentence_lx
nlp_en = get_nlp_object("en", use_critt_tokenization = False, package="partut")
input = "The reporter who the senator who John met attacked disliked the editor."
tokens_scores_list, aggregated_score = get_sentence_lx(
input,
nlp_en,
result_format="segment",
complexity_type="idt",
aggregation_type="sum")
print(f"Tokens Scores List == {tokens_scores_list}")
print(f"Aggregated Score == {aggregated_score}")
This should print the incomplete complexity theory (IDT) metric list with related tokens and aggregated score using aggregated function sum
:
Tokens Scores List == [['The', 1], ['reporter', 2], ['who', 3], ['the', 4], ['senator', 3], ['who', 4], ['John', 5], ['met', 2], ['attacked', 2], ['disliked', 2], ['the', 3], ['editor', 1], ['.', 0]]
Aggregated Score == 32
We provide a set of quick tutorials to get you started with the library:
The tutorials explain how the base metrics can be obtained. Let us know if anything is unclear.
The CRITT Translation Process Database (TPR-DB) is released under Creative Commons License (CC BY-NC-SA). Note that the available EN-ZH_IMBst18 database in this github belongs to CRITT TPR-DB.
Please cite the paper https://doi.org/10.1007/978-3-030-98404-5_49 :
@InProceedings{10.1007/978-3-030-98404-5_49,
author="Zou, Longhui
and Carl, Michael
and Mirzapour, Mehdi
and Jacquenet, H{\'e}l{\`e}ne
and Vieira, Lucas Nunes",
editor="Kim, Jong-Hoon
and Singh, Madhusudan
and Khan, Javed
and Tiwary, Uma Shanker
and Sur, Marigankar
and Singh, Dhananjay",
title="AI-Based Syntactic Complexity Metrics and Sight Interpreting Performance",
booktitle="Intelligent Human Computer Interaction",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="534--547",
isbn="978-3-030-98404-5"
}
For IDT-based and DLT-based complexities, please cite this paper:
@incollection{mirzapour2020,
title={Measuring Linguistic Complexity: Introducing a New Categorial Metric},
author={Mirzapour, Mehdi and Prost, Jean-Philippe and Retor{\'e}, Christian},
booktitle={Logic and Algorithms in Computational Linguistics 2018 (LACompLing2018)},
pages={95--123},
year={2020},
publisher={Springer}
}
Please email your questions or comments to Mehdi Mirzapour.
LingX is licensed under the following MIT License (MIT) Copyright © 2021 ContentSide and CRITT at Kent State University.