Statement segmentor (pre-suggesting statement anchors)

davidzbiral commented 1 year ago

Adam has this idea of implementation: the idea would be to let the user open a dialog where

they can paste a longer free text and select a territory in a suggester
the application will slice the free text into a slices
user can then check the slices, edit them
and after clicking "accept" all the statements will be created in the chosen Territory.

EDIT: This will benefit from close collaboration with @GideonK.

adammertel commented 1 year ago

The first mockup by @pondrejk

davidzbiral commented 1 year ago

@adammertel : this is quite viable I think as a mockup. My couple of comments:

I think there is not the need to have the Submit button in the first stage before any parsing, because there is nothing to store.
I would have the labels as follows: "Pre-fill text fields", "Clear" (as now), "Segment" (instead of Parse), and "Create statements" (instead of "Submit" - from submit, it is not so completely clear that e.g. 20 statements will be created by this action).
It is clear how further subdivision will be done - but how, by contrast, merging of something divided improperly will be done, is not clear to me.
It does not count, I believe, with divided sentences. - This very sentence is divided by "I believe" to illustrate :-) I.e. if this division is not recognized, then we need a way to make a different division, where the first statement text will be "It does not count {...} with divided sentences" (because it started as the first one), and second will be "I believe".

I also tag @GideonK - if you have any comments or any contribution, please tell us. Of course, the actual implementation of the segmentation will need your expertise - i.e. to recognize what counts as one statement's text. Generally, it is defined by one predicate - so the task will be to identify the predicate and its dependencies. Punctuation helps a lot, since we are working with editions which have modern-style punctuation (unlike many medieval mss.).

@adammertel , you probably count on collaborating with Gideon on the actual segmentation?

BTW Gideon, is "segmentation" a good term here? Or is it tokenization? (This is more used I think about segmentation into words than clauses, though.)

pondrejk commented 1 year ago

@davidzbiral @adammertel

Updated mockup

Changes:

it is possible to set a different separator than .
the segmented lines now form their own text fields that can be edited freely
refresh button allows additional segmentation, either by placing the original separator somewhere into the sentence or by choosing a different separator
it is possible to delete lines by x button

davidzbiral commented 1 year ago

the segmented lines now form their own text fields that can be edited freely

Good move. But @pondrejk and @adammertel , the segmentation by a character separator will not work very well - we will need dependency parsing (with the help of @GideonK) for this to spare any bigger amount of time. This dependency parsing either needs to be hooked directly into InkVisitor (optimal) or done separately and add a distinct separator.

More generally, did you consider how this should be done in the light of future full-text management features of InkVisitor, so that we look towards the future? Ultimately, the idea is that InkVisitor will contain the plain text of the Subterritory as some kind of field of the T entity, and we will mark what parts we are modelling in this Statement (sometimes separated with ). This should, at the background, add some XML-like markup into the plain text with the UUID of the statement which models this text. Mind that sometimes the text is, I must say, separated by inserted clauses (such as in this sentence the "I must say" segment).

So I'd like if we could devise something compatible with this future solution, you will find those things by searching the tag full-text.

adammertel commented 1 year ago

@davidzbiral I am not sure how much effort is to implement the dependency parsing mechanisms into InkVisitor. Are you at this moment okey with the character separator and manual edit of the statements or do you want to postpone this feature until we can implement the dependency parsing?

GideonK commented 1 year ago

I am currently in the process of comparing pre-processing pipelines of Latin to each other, which includes checking the quality of the part-of-speech tags etc. that would inform a parser in the next step and directly affects its quality. There are a few different approaches, but I have found one that seems to be compatible for Latin, at least for the purpose of dependency parsing in InkVisitor:

CoreNLP is an NLP package written in Java developed by the Stanford NLP Group. The Stanza Python package (previously called StanfordNLP) includes a precompiled version of CoreNLP; so it seems, no need to set up a server (I need to test the proposed pipeline, but Stanza is also implemented by CLTK, the Classical Language Toolkit, which I've already used for experiments).
Stanza includes Universal Dependency models including for Latin.
There is a tool called Semgrex, which implements regex-like patterns to match dependencies. It has support for both Java and Python, the latter of which can also use displaCy as in David's link, for vizualization. So we can possibly use this tool to build and apply rules in helping us decide how to split sentences into statements. Semgrex was reimplemented as part of spaCy 3.0 with a tool called DependencyMatcher.
There is an online interface where you search for examples from existing UD treebanks, including Latin.

It may be better to use a sentence boundary detection tool (should be implemented as part of the Stanza pipeline) instead of selecting a specific punctuation mark, as abbreviations will be handled incorrectly, and it may be that some sentences have unusual endings (single-character ellipses, etc.). Quite possibly, we don't even need to use a separate tool, as (from the paper) "Semgrex reads dependency trees from CoNLL-U files or parses dependencies from raw text using the associated CoreNLP parser."

If we can get this running, the next will be building the rules to decide how sentences are split into statements, including examples such as the one provided by David. As I'm both not a Latin expert nor completely informed on the UD tagset and grammar, this may need to be built up from the ground. From a quick look (and in my view a little easier to understand from the DependencyMatcher link), the operators can be quite useful.

If the statements can be edited and resegmented (I would say that's the word - tokenization is just the separation of words from punctuation), it will help to deal with less-than-perfect parsing. But if the parser performs poorly, then it might not be worth it.

adammertel commented 1 year ago

@GideonK Thank you for a knowledgeable answer. Unfortunately, we are now using Javascript on both Backend and Frontend; therefore, incorporating a Python package would mean a new level of complexity.

Now I see two options. Either (i) use an existing JS package, for example, wink-nlp or nnsplit, but unfortunately, none of them support Latin. Or (ii) use a dumb method of parsing the text by given characters (e.g., ., ,, ;) and depend on the user to edit the result manually.

While this function is not going to be used daily, and even the good language models cannot perfectly accommodate CASTEMO principles, I would go with (ii).

Using regex would be ideal while it is possible to incorporate it in any programming language quickly, but I am not sure how production-ready the idea is.

GideonK commented 1 year ago

@adammertel This might work as long as there are not many abbreviations in the text, as regex is not sufficient for this otherwise. Some algorithms are more language agnostic so could be looked into. But if we expect just a certain set of characters to look out for, sure (ii) should be fine.

Another option could be to pre-split the sentences with e.g. CLTK before they are being dropped into InkVisitor? i.e. each one appears on a separate line already, and there is no need to in corporate the splitter into the workflow. Or is this contrary to what you are planning?

EDIT:

There is this, which is an object that "enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string."
Then this, which is a rule-based script written for English that could be adapted for Latin, perhaps this can kickstart the process if we make use of regexes.

tomaham commented 1 year ago

Well. It should not be too hard to stitch a python api script called as http service.

On Wed, Mar 15, 2023, 7:20 PM Gideon Kotzé @.***> wrote:

@adammertel https://github.com/adammertel This might work as long as there are not many abbreviations in the text, as regex is not sufficient for this otherwise. Some algorithms are more language agnostic so could be looked into. But if we expect just a certain set of characters to look out for, sure (ii) should be fine.

Another option could be to pre-split the sentences with e.g. CLTK before they are being dropped into InkVisitor? i.e. each one appears on a separate line already, and there is no need to in corporate the splitter into the workflow. Or is this contrary to what you are planning?

— Reply to this email directly, view it on GitHub https://github.com/DISSINET/InkVisitor/issues/1610#issuecomment-1470045430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4Q7CNWOPE2V7ZZK6ZSLIDW4HCI3ANCNFSM6AAAAAAU2FVJYU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

davidzbiral commented 1 year ago

@adammertel : I do not understand the technicalities. The character separation will not work very well, so then the critical point is the convenience of how the re-segmentation upon request is implemented.

Let's not drop a more complex sentence delimitation too easily if it is done in coordination with @GideonK and makes us move also in DISSINET's NLP strand? It should ultimately pay off?

In any case, the solution should take into account the broader horizon which InkVisitor must follow - becoming an annotation tool holding full texts (#1531).

davidzbiral commented 1 year ago

Paused because a thorough full-text management is in view, but will be quite needed then. Just not slicing it into text fields, but prepopulating full-text (saving only once confirmed) with statement start and end tags. Can be done immediately, but only commited and pushed to the GitHub corpus repo once confirmed by the user.

Punctuation is not enough - we should use a dependency parsing service. Dependency parsing for Latin tested with Gideon and not bad at all => let's use it.

davidzbiral commented 1 year ago

In the first version, we can use existing text fields in inkvisitor to learn where to segment.

davidzbiral commented 1 year ago

Agreed with Adam: first version will use punctuation, good user experience with user correction of the custom segmentation is key.

Step 2 is using dependency parsing (involve Gideon). Step 3 is using machine learning based on existing DISSINET statement text fields.

davidzbiral commented 10 months ago

@adammertel After viewing some tens of examples, I can say that we need dependency parsing, and delimiting it by punctuation is not viable at all.

But we have a solution. Adam and @GideonK please meet and work together to implement Hackathon6-developed clause_text delimitation method (i.e. finding spans of text which are candidates for statements based on PoS identifying VERB and then their most distant dependents to the left and to the right in the full-text). This is exactly the same span that we need for this InkVisitor purpose. (And Gideon, we also need for this purpose to delimite the subordinate clause/s spanned by the span, and in the way that you enclose them with curly brackets, for this InkVisitor purpose we need to delimit statement 1 part 1 with statement 1 anchors, then delimit statement 2 (subordinate clause), and then again return with a second pair of anchors to statement 1 (part 2). I hope it is clear, we can all meet to implement this asap.

davidzbiral commented 6 months ago

@adammertel If you think it is a good solution, move to 1.4.1 - segmentor not needed in 1.4.0, conveniently manually editing statement anchors is. As you want.

davidzbiral commented 6 months ago

This will still need a more discursive form than issue list - a document to consider, talk, brainstorm. Created here.

davidzbiral commented 6 months ago

@adammertel I am also signalling that we have ca. 300 gold-standard (manually done) segmentation examples on random verbal observations from four registers (in this folder) which could be used as training data, in Hackathon 11 (16-20 Sep 2024) we should achieve something like 700-1000.

davidzbiral commented 6 months ago

@adammertel , @GideonK How this is going? I would hate to postpone this too much, but should it be 1.4.0, or 1.4.1? Are enough human resources available to make it 1.4.0 if we are not using punctuation-based but dependency-parsing-based approach (on the basis of clause delimitation done by Gideon)?

adammertel commented 6 months ago

@davidzbiral I would move this to 1.4.1 considering that we would like to have the 1.4.0 as soon as possible.

GideonK commented 6 months ago

@adammertel , @GideonK How this is going? I would hate to postpone this too much, but should it be 1.4.0, or 1.4.1? Are enough human resources available to make it 1.4.0 if we are not using punctuation-based but dependency-parsing-based approach (on the basis of clause delimitation done by Gideon)?

Sorry for only getting back to you now, we did discuss this in person and on Slack. To my understanding from your feedback on the performance of the dependency-based statement delimitation in the validation dataframe, it has issues but is workable. As you know, Adam and I had a meeting and I have more or less an idea what needs to happen from my side.

Firstly, the current implementation is embedded in a relation extraction workflow where a sentence is processed, and from left to right we search for verbs. If it is the head of the sentence (main verb), and because of legacy reasons (Agency, such as producing different types of statistics, etc.) it is treated in a separate command as other verbs, but eventually, they are processed the same way. The standard input therefore is a verb and its dependencies, should there be any. Of course it is modular, so the class that actually performs the tagging can be reused.

Although each token is associated with its sentence through the spaCy doc object, basically we only work with the clause when creating the output tags. So whatever the verb is that is the current focus verb, from the point of view of the clause, that is the "main" verb and will govern whatever appears in . It will therefore not handle any superordinate clauses, so an output can never start with and then - rather, the real will be handled in another run, with the current as (i.e. the superordinate clause will start with ... but the current (subordinate) one just with ...).

If I remember correctly, Adam has mentioned that we need two distinctive features: First, we need to be able to produce a continuous stream of text that includes tags. This is a different assignment than what we currently have because we work with focus verbs (each new verb is the current focus) where for each one, we generated a separate , , etc. However, in a continuous stream of text, we cannot have multiple S1s, etc. in the output. I therefore need to think of a way to either merge outputs or adapt the algorithm to work with a single sentence possibly with multiple verbs. I'm not clear yet on what to do here.

Secondly, my understanding is that a user should be able to select any piece of text and then the suggestor will automatically add suggested tags in the text. At the least there should also be some significant adaptations. Firstly, it is a bad idea to try and parse a string less than a full sentence, so probably, we need to train a doc object from the context. Then when a text is selected, we will have access to the perceived subtree spans that exist within that selection. It might become problematic if we go over sentence boundaries, but we should of course have a default action for each possible user decision. I think it's easiest if the tags are made beforehand in the background, and after any editing, redone on the new selection.

Of course in the future we are planning to have a suggester trained on the manual outputs that is (probably?) decoupled from the dependency graphs but probably still making use of at least a part-of-speech tagger as input features.

davidzbiral commented 5 months ago

@GideonK

However, in a continuous stream of text, we cannot have multiple S1s, etc. in the output. Yes - of course this will not be S1, but the numbering will be continuous from S1 to S9999999999999999999, so a new main clause will not be S1 again. S1 will not repeat in the output that the parser will return. It will be identifiable as main clause still by the fact that it is not enclosed in any other tag.

Secondly, my understanding is that a user should be able to select any piece of text and then the suggestor will automatically add suggested tags in the text. At the least there should also be some significant adaptations. Firstly, it is a bad idea to try and parse a string less than a full sentence (...)

This we don't need I think, this will be used by human editors. If no verb is detected, it will return no suggestion. If it is, it work with what the dependency parser will return, even if it is nonsense. It will be competent human editorss who will be making decisions on what to select.

DISSINET / InkVisitor

Statement segmentor (pre-suggesting statement anchors) #1610