Open davidzbiral opened 1 year ago
The first mockup by @pondrejk
@adammertel : this is quite viable I think as a mockup. My couple of comments:
I also tag @GideonK - if you have any comments or any contribution, please tell us. Of course, the actual implementation of the segmentation will need your expertise - i.e. to recognize what counts as one statement's text. Generally, it is defined by one predicate - so the task will be to identify the predicate and its dependencies. Punctuation helps a lot, since we are working with editions which have modern-style punctuation (unlike many medieval mss.).
@adammertel , you probably count on collaborating with Gideon on the actual segmentation?
BTW Gideon, is "segmentation" a good term here? Or is it tokenization? (This is more used I think about segmentation into words than clauses, though.)
@davidzbiral @adammertel
Updated mockup
Changes:
the segmented lines now form their own text fields that can be edited freely
Good move. But @pondrejk and @adammertel , the segmentation by a character separator will not work very well - we will need dependency parsing (with the help of @GideonK) for this to spare any bigger amount of time. This dependency parsing either needs to be hooked directly into InkVisitor (optimal) or done separately and add a distinct separator.
More generally, did you consider how this should be done in the light of future full-text management features of InkVisitor, so that we look towards the future? Ultimately, the idea is that InkVisitor will contain the plain text of the Subterritory as some kind of field of the T entity, and we will mark what parts we are modelling in this Statement (sometimes separated with ). This should, at the background, add some XML-like markup into the plain text with the UUID of the statement which models this text. Mind that sometimes the text is, I must say, separated by inserted clauses (such as in this sentence the "I must say" segment).
So I'd like if we could devise something compatible with this future solution, you will find those things by searching the tag full-text.
@davidzbiral I am not sure how much effort is to implement the dependency parsing mechanisms into InkVisitor. Are you at this moment okey with the character separator and manual edit of the statements or do you want to postpone this feature until we can implement the dependency parsing?
I am currently in the process of comparing pre-processing pipelines of Latin to each other, which includes checking the quality of the part-of-speech tags etc. that would inform a parser in the next step and directly affects its quality. There are a few different approaches, but I have found one that seems to be compatible for Latin, at least for the purpose of dependency parsing in InkVisitor:
It may be better to use a sentence boundary detection tool (should be implemented as part of the Stanza pipeline) instead of selecting a specific punctuation mark, as abbreviations will be handled incorrectly, and it may be that some sentences have unusual endings (single-character ellipses, etc.). Quite possibly, we don't even need to use a separate tool, as (from the paper) "Semgrex reads dependency trees from CoNLL-U files or parses dependencies from raw text using the associated CoreNLP parser."
If we can get this running, the next will be building the rules to decide how sentences are split into statements, including examples such as the one provided by David. As I'm both not a Latin expert nor completely informed on the UD tagset and grammar, this may need to be built up from the ground. From a quick look (and in my view a little easier to understand from the DependencyMatcher link), the operators can be quite useful.
If the statements can be edited and resegmented (I would say that's the word - tokenization is just the separation of words from punctuation), it will help to deal with less-than-perfect parsing. But if the parser performs poorly, then it might not be worth it.
@GideonK Thank you for a knowledgeable answer. Unfortunately, we are now using Javascript on both Backend and Frontend; therefore, incorporating a Python package would mean a new level of complexity.
Now I see two options. Either (i) use an existing JS package, for example, wink-nlp or nnsplit, but unfortunately, none of them support Latin. Or (ii) use a dumb method of parsing the text by given characters (e.g., .
, ,
, ;
) and depend on the user to edit the result manually.
While this function is not going to be used daily, and even the good language models cannot perfectly accommodate CASTEMO principles, I would go with (ii).
Using regex
would be ideal while it is possible to incorporate it in any programming language quickly, but I am not sure how production-ready the idea is.
@adammertel This might work as long as there are not many abbreviations in the text, as regex is not sufficient for this otherwise. Some algorithms are more language agnostic so could be looked into. But if we expect just a certain set of characters to look out for, sure (ii) should be fine.
Another option could be to pre-split the sentences with e.g. CLTK before they are being dropped into InkVisitor? i.e. each one appears on a separate line already, and there is no need to in corporate the splitter into the workflow. Or is this contrary to what you are planning?
EDIT:
Well. It should not be too hard to stitch a python api script called as http service.
On Wed, Mar 15, 2023, 7:20 PM Gideon Kotzé @.***> wrote:
@adammertel https://github.com/adammertel This might work as long as there are not many abbreviations in the text, as regex is not sufficient for this otherwise. Some algorithms are more language agnostic so could be looked into. But if we expect just a certain set of characters to look out for, sure (ii) should be fine.
Another option could be to pre-split the sentences with e.g. CLTK before they are being dropped into InkVisitor? i.e. each one appears on a separate line already, and there is no need to in corporate the splitter into the workflow. Or is this contrary to what you are planning?
— Reply to this email directly, view it on GitHub https://github.com/DISSINET/InkVisitor/issues/1610#issuecomment-1470045430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4Q7CNWOPE2V7ZZK6ZSLIDW4HCI3ANCNFSM6AAAAAAU2FVJYU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@adammertel : I do not understand the technicalities. The character separation will not work very well, so then the critical point is the convenience of how the re-segmentation upon request is implemented.
Let's not drop a more complex sentence delimitation too easily if it is done in coordination with @GideonK and makes us move also in DISSINET's NLP strand? It should ultimately pay off?
In any case, the solution should take into account the broader horizon which InkVisitor must follow - becoming an annotation tool holding full texts (#1531).
Paused because a thorough full-text management is in view, but will be quite needed then. Just not slicing it into text fields, but prepopulating full-text (saving only once confirmed) with statement start and end tags. Can be done immediately, but only commited and pushed to the GitHub corpus repo once confirmed by the user.
Punctuation is not enough - we should use a dependency parsing service. Dependency parsing for Latin tested with Gideon and not bad at all => let's use it.
In the first version, we can use existing text fields in inkvisitor to learn where to segment.
Agreed with Adam: first version will use punctuation, good user experience with user correction of the custom segmentation is key.
Step 2 is using dependency parsing (involve Gideon). Step 3 is using machine learning based on existing DISSINET statement text fields.
@adammertel After viewing some tens of examples, I can say that we need dependency parsing, and delimiting it by punctuation is not viable at all.
But we have a solution. Adam and @GideonK please meet and work together to implement Hackathon6-developed clause_text delimitation method (i.e. finding spans of text which are candidates for statements based on PoS identifying VERB and then their most distant dependents to the left and to the right in the full-text). This is exactly the same span that we need for this InkVisitor purpose. (And Gideon, we also need for this purpose to delimite the subordinate clause/s spanned by the span, and in the way that you enclose them with curly brackets, for this InkVisitor purpose we need to delimit statement 1 part 1 with statement 1 anchors, then delimit statement 2 (subordinate clause), and then again return with a second pair of anchors to statement 1 (part 2). I hope it is clear, we can all meet to implement this asap.
@adammertel If you think it is a good solution, move to 1.4.1 - segmentor not needed in 1.4.0, conveniently manually editing statement anchors is. As you want.
This will still need a more discursive form than issue list - a document to consider, talk, brainstorm. Created here.
@adammertel I am also signalling that we have ca. 300 gold-standard (manually done) segmentation examples on random verbal observations from four registers (in this folder) which could be used as training data, in Hackathon 11 (16-20 Sep 2024) we should achieve something like 700-1000.
@adammertel , @GideonK How this is going? I would hate to postpone this too much, but should it be 1.4.0, or 1.4.1? Are enough human resources available to make it 1.4.0 if we are not using punctuation-based but dependency-parsing-based approach (on the basis of clause delimitation done by Gideon)?
@davidzbiral I would move this to 1.4.1 considering that we would like to have the 1.4.0 as soon as possible.
@adammertel , @GideonK How this is going? I would hate to postpone this too much, but should it be 1.4.0, or 1.4.1? Are enough human resources available to make it 1.4.0 if we are not using punctuation-based but dependency-parsing-based approach (on the basis of clause delimitation done by Gideon)?
Sorry for only getting back to you now, we did discuss this in person and on Slack. To my understanding from your feedback on the performance of the dependency-based statement delimitation in the validation dataframe, it has issues but is workable. As you know, Adam and I had a meeting and I have more or less an idea what needs to happen from my side.
Firstly, the current implementation is embedded in a relation extraction workflow where a sentence is processed, and from left to right we search for verbs. If it is the head of the sentence (main verb), and because of legacy reasons (Agency, such as producing different types of statistics, etc.) it is treated in a separate command as other verbs, but eventually, they are processed the same way. The standard input therefore is a verb and its dependencies, should there be any. Of course it is modular, so the class that actually performs the tagging can be reused.
Although each token is associated with its sentence through the spaCy doc object, basically we only work with the clause when creating the output tags. So whatever the verb is that is the current focus verb, from the point of view of the clause, that is the "main" verb and will govern whatever appears in
If I remember correctly, Adam has mentioned that we need two distinctive features: First, we need to be able to produce a continuous stream of text that includes tags. This is a different assignment than what we currently have because we work with focus verbs (each new verb is the current focus) where for each one, we generated a separate
Secondly, my understanding is that a user should be able to select any piece of text and then the suggestor will automatically add suggested tags in the text. At the least there should also be some significant adaptations. Firstly, it is a bad idea to try and parse a string less than a full sentence, so probably, we need to train a doc object from the context. Then when a text is selected, we will have access to the perceived subtree spans that exist within that selection. It might become problematic if we go over sentence boundaries, but we should of course have a default action for each possible user decision. I think it's easiest if the tags are made beforehand in the background, and after any editing, redone on the new selection.
Of course in the future we are planning to have a suggester trained on the manual outputs that is (probably?) decoupled from the dependency graphs but probably still making use of at least a part-of-speech tagger as input features.
@GideonK
However, in a continuous stream of text, we cannot have multiple S1s, etc. in the output. Yes - of course this will not be S1, but the numbering will be continuous from S1 to S9999999999999999999, so a new main clause will not be S1 again. S1 will not repeat in the output that the parser will return. It will be identifiable as main clause still by the fact that it is not enclosed in any other tag.
Secondly, my understanding is that a user should be able to select any piece of text and then the suggestor will automatically add suggested tags in the text. At the least there should also be some significant adaptations. Firstly, it is a bad idea to try and parse a string less than a full sentence (...)
This we don't need I think, this will be used by human editors. If no verb is detected, it will return no suggestion. If it is, it work with what the dependency parser will return, even if it is nonsense. It will be competent human editorss who will be making decisions on what to select.
Adam has this idea of implementation: the idea would be to let the user open a dialog where
EDIT: This will benefit from close collaboration with @GideonK.