DISSINET / InkVisitor

An open-source, browser-based front-end application for the collection of complex structured data from textual resources in history and the social sciences into a RethinkDB database for further analysis.
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Extending the data model with part-of-speech attribute #1632

Closed davidzbiral closed 1 year ago

davidzbiral commented 1 year ago

Consider extending adding PoS as the attribute of the Concept model, and automatically impose the verb PoS in the Action model.

I think we should actually start using PoS attribute with As (where it is always verb) and Cs (nouns, adjectives, adverbs, and composites thereof), and did issuize to debate this on GitHub. It is easy, logical, and we should confirm that we want to do it together, and ask for implementation. It will bring NLP and CASTEMO closer in their opportunities of meeting soon more than they do now. We just must deal with composites aka multi-word expressions, i.e. how to classify them - probably as their core component that the other parts modify , but how to identify this core, to which the superclass relates, formally in the structured data?

adammertel commented 1 year ago

Maybe a stupid question but what is PoS actually?

davidzbiral commented 1 year ago

@adammertel: Part of speech attribute. (I changed the title now.) https://en.wikipedia.org/wiki/Part_of_speech. It is an important category in lexico-semantic networks and dictionaries, and thus in NLP - i.e. adding this attribute to Concepts, and an implicit (hidden) one to Actions (which are always verbs - some discouraged ones are not, from very old days, kept for back compatibility) would make our lexico-semantic network (aka Cs+As) more aligned to other language resources and more useful in the DISSINET NLP strand.

Technically, it would be one further "in-entity" attribute, like label language. Choice from a predefined list of parts of speech.

adammertel commented 1 year ago

This is not a big technical problem :) You can move it to 1.4. (1.3.2 most probably) The only thing to consider is ensuring this value will be added to the data collected with the "old" data model. @jancimertel, what about sth like a "reparse" script checking the consistency between new additions to the data model and data in db? Can you think about how that may work?

davidzbiral commented 1 year ago

@adammertel Let's go forward with this. How we should build the dictionary of PoS? @GideonK , could you point me to a good dictionary with as standard abbreviations as possible, to build this lexicon? BTW since it anyway makes no sense to describe the full dependency graph of multi-word units in a linear, non-graph way, for multi-word lexical units we will work on the understanding that the PoS tag is describing the head word. And let's implement.

Let us decide how the PoS should be handled in Actions, where it is always Verb (not attach, because by definition, and only attach in DDB2? Attach by default, not display? Attach and display but disable editing - probably superfluous, if not for showing users what's happening at the background?).

GideonK commented 1 year ago

@davidzbiral @adammertel I'm assuming that tags will be added by hand and labels will always be in lemma form? This means that we don't have to worry about inflectional features etc. And also, we will only use a small number of actual tags for the labels, if I understand correctly.

Perhaps for compatibility with the use of Universal Dependencies, we can opt to use its PoS tagset (UD v. 2) But there are other options, here is a very simple tagset used by OMNIA project, MedioLatin. Then there is the Lamap tagset (which I think is used by CLTK). Sketch Engine, which is perhaps the most well-known corpus annotation and analysis tool, uses TreeTagger to annotate corpora and uses a simplified version of Lamap.

As for multi-word lemmas, I can think of two approaches:

(1) This involves another tool, but if we had access to phrase-structure parsing, the solution can be simple. A noun phrase, for example, means that the phrase is headed by the noun (be it a common noun, proper, etc.). Similar with verb phrases and verbs. We don't have to select which word is the head, as the "PoS" will be "noun" anyway (if we go down this road).

(2) Dependency parsing: We use the tools we already have, then we just need a set of rules based on how the parser treats phrases. I am already working on that (in some sense) in a Google Sheet. But normally, the word which governs the phrase (i.e. all the other words in the phrase are directly or indirectly dependent on this one) is the head of the phrase, and we can perhaps say that this is the "PoS" of the phrase. Of course not in the true sense, as single PoS tags are mostly not used to describe syntactic trees involving multiple (different) PoS tags. So linguistically, it would not be correct to say that "Bob and Pete" is a noun. But it is a noun phrase, so perhaps down the line, we can incorporate a rule saying that multi-word "nouns" with some features should be labeled "noun phrase" (some multi-word nouns are still nouns, like "Frank Zappa", and perhaps "X de la Z" (if it is a name)).

davidzbiral commented 1 year ago

Yes, added by hand in a new field suggesting the allowed values. The head word is always the lemma form (of course, some of its dependencies are inflected).

In Actions, we happen to use as lemma 3rd person perfect, unlike Latin dictionaries which use 1st person present.

Let use the Universal Dependencies tag set. It is a little bit of a stretch to tag multi-word lexical units with tags for single words, I understand that. Perhaps we could use something like NOUN / NOUN PHRASE, ADVERB or ADV phrase, etc.? And if multi-word, it would be the phrase, implicitly and in DDB2.

GideonK commented 1 year ago

It might be worth it to see how WordNet treats multi-word lemmas, as WordNet is probably the closest analogy here? (only with four PoS tags)

davidzbiral commented 1 year ago

Yes, could you inquire into it? But multi-word lemmas are quite rare in WordNet, it is not its focus.

davidzbiral commented 1 year ago

I just don't want to impose complex tagging on historians in the team. If we could use external dep. parsing and only correct, it could be a way of course.

GideonK commented 1 year ago

There are actually many thousands of multi-word lemmas in WordNet, but seemingly none in Latin WordNet. Here are some examples from (Princeton) WordNet (underscores indicate white-space characters):

a: (adjective)

'politically_correct' 'mounded_over' 'marched_upon' 'in_vivo' 'a_posteriori' 'a_la_carte' (only adjective with more than two words)

n: (noun)

'Marston_Moor', 'battle_of_Marston_Moor' 'Manila_grass', 'Japanese_carpet_grass', 'Zoysia_matrella' 'Napoleon_III', 'Emperor_Napoleon_III', 'Charles_Louis_Napoleon_Bonaparte' 'National_Library_of_Medicine', 'United_States_National_Library_of_Medicine', 'U.S._National_Library_of_Medicine' 'Nation_of_Islam' 'New_Delhi', 'Indian_capital', 'capital_of_India' 'Nigerian_monetary_unit' "Noah's_flood", 'Noachian_deluge', 'Noah_and_the_Flood', 'the_Flood'

v: (verb)

'act_involuntarily', 'act_reflexively' 'feel_like_a_million', 'feel_like_a_million_dollars' "feather_one's_nest" "feast_one's_eyes" 'fly_in_the_face_of', 'fly_in_the_teeth_of' 'give_it_a_whirl', 'give_it_a_try' "keep_one's_eyes_peeled", "keep_one's_eyes_skinned", "keep_one's_eyes_open" 'know_the_score', 'be_with_it', 'be_on_the_ball', "know_what's_going_on", "know_what's_what" 'make_a_point', 'make_sure'

r: (adverb)

'in_due_course', 'in_due_season', 'in_good_time', 'in_due_time', 'when_the_time_comes' 'from_way_back', 'since_a_long_time_ago' 'off_the_record' 'out_of_thin_air', 'out_of_nothing', 'from_nowhere' 'in_the_nick_of_time', 'just_in_time' 'for_all_practical_purposes', 'to_all_intents_and_purposes', 'for_all_intents_and_purposes' 'ex_officio', 'by_right_of_office'

s: (adjective satellite, e.g. "atomic" in "atomic bomb")

'well_thought_out' 'non_compos_mentis', 'of_unsound_mind' 'off_the_hook' 'hot_under_the_collar' 'below_the_belt' 'without_a_stitch'

Note the fifth PoS tag here, 's', which is something I didn't know about and we also didn't use it in African WordNet. But it seems to be quite rare anyway and it seems to be something that the creators of WordNet created themselves. So what we see here is that idioms and collocations are common, and that they are tagged according to the syntactic function that they fulfill, and would correspond to its phrase structure. So for example, with "It will happen in due course", "in due course" is used adverbially, "politically correct" is an adjectival phrase, "politically correct person" (if it were a lemma) would be a noun phrase, so i.e. a noun (WordNet style).

The dependency parse output is not always correct, as we've seen cases where a proper noun was thought to be a verb. So there must be some human control. Perhaps it would be possible to generate a tree of the sentence or phrase in a side window or something (using displaCy perhaps?), but leave the actual choice to the person to make the tag.

davidzbiral commented 1 year ago

@adammertel, let's move forward with this. Please implement this tag set in the model of Cs and As: https://universaldependencies.org/u/pos/. Any Action will have uneditable "verb" tag - for symmetry with Cs and understanding the data by users, I suggest to display it, even if uneditably. Any Concept will have the choice (only one option possible, no multichoice) from this list (keep this order I state here): NOUN ADJ PRON ADV NUM ADP CCONJ SCONJ DET INTJ PART Put the field underneath "Label language".