Orange-OpenSource / conllueditor

ConllEditor is a tool to edit dependency syntax trees in CoNLL-U format.
BSD 3-Clause "New" or "Revised" License
54 stars 17 forks source link

Proposal for mass editing #12

Closed Stormur closed 2 years ago

Stormur commented 3 years ago

Hi! As always, first I make you compliments for this application, which we us for annotation with ever great fruitfulness! Great work! :slightly_smiling_face:

I was wondering, since there are some advanced search functions already but no relative point on the TODO-list, if it would be possible to add a function for mass editing based on (regular expression) searches; and in general also including the sentence text in the possible search.

Some dummy examples:

To have such a functionality already integrated in a shell that reads and edits CoNLL-U+ files would be great! Or is it already there and I am missing it?

jheinecke commented 3 years ago

Hi! Thanks :-) It's nice to hear that people like and use my tool (I have no idea how many annotators work with it, drop me a note :-))

Mass editing is a good idea (for the time I do it with more or less intelligent python programmes, but integrate it in the editor would be nice). I will think of that (has to be rather foolproof to avoid major undetected data corruption). What other kind of rules can you think of ? Rules applied on condition on simple tokens does not look to difficult to implement. But rules which take into account the dependencies are trickier (change the Feature F=V to F=V2 if the head of the word has a dependant case ...) Did you think of an interaction with the GUI or a script which reads rules and the conllu-file and outputs the applied rules. If it is this, maybe LORIAs https://grew.fr/ would be a solution ?

concerning CoNLL-U+ there is partial support : The 10 standard CoNLL-U columns must be present, but the editor allows to edit any additional column (without testing whether it makes sense, for instant it does not check whether in a B-I-O tag which spans multiple tokens the first starts with B: and the subsequent with I: If you want to edit CoNLL-U+ which do not contain all 10 standard columns there is a script bin/conlluconvert.sh

Stormur commented 3 years ago

Our team is, but the numbers vary!

First, the last part: I wasn't asking for a particular treatment for CoNLL-U+, but just refering to the capability of conllueditor to treat both CoNLL-U and CoNLL-U plus files!

I am too using more or less standardised Python routines, but as the complexity of our annotational work increases, they become more and more intricated: even just reading and printing syntactic trees is not as simple as it seems!

What I am envisioning for now is just on the simpler level: this would already help greatly. We can think that more complex substitutions, involving for example multiword tokens, conditions on siblings, conditions on other relations and so on can be left to the user to solve in other ways for the moment. So I'd just stick to things such as:

Admittedly, I have not yet tried to use grew for graph rewriting, but I'll see what it can do!

Stormur commented 3 years ago

By the way, since you mentioned it: could you somehow point me to a contact to a member of the Grew project? I have started trying it out and have questions on some issues, but I seem not to be able to find some contacts. Thanks in advance!

And just as a comment: from what I understand, with a tool like Grew it is not possible to just change e.g. the dependency relation of a node if there is not already an edge, as Grew only works with graphs. But such an edit is a routine for me and it is very useful as a sort of "pre-annotation", and I reiterate that conllueditor would be the perfect place for it! :slightly_smiling_face:

jheinecke commented 3 years ago

I'll look what I can do for the change function you propose. It has to be more foolproof than the rest since you can easily make an error in the find condition and change things which should stay put. This sias there is always git to come back to the point of start.

When annotating new sentences from scratch (for UD_Welsh) I have trained Udpipe on the existing sentences at make it preannotate the new stuff. And than I correct and validate (with some scripts of linguistic knowledge which check checkable things (non ambiguous endings (line odd, always "3 person singular past tense").

For a grew contact, check at the bottom of https://grew.fr/grew_match/help/

Stormur commented 3 years ago

For a grew contact, check at the bottom of https://grew.fr/grew_match/help/

Thanks! ... I feel somewhat stupid, probably I was too hasty. I am editing part of my previous comment.

When annotating new sentences from scratch (for UD_Welsh) I have trained Udpipe on the existing sentences at make it preannotate the new stuff. And than I correct and validate (with some scripts of linguistic knowledge which check checkable things (non ambiguous endings (line odd, always "3 person singular past tense").

I also have a similar routine, but I noticed that quite often I discover systematic errors of the automatic annotation, or also evidence that prompts me to change already annotated sentences, only while working on the annotation itself! This also gives more control, because I find that "post-production" can often be too rough.

OK, I have bothered you enough, thanks again for all the past and upcoming updates! :slightly_smiling_face:

jheinecke commented 3 years ago

HI @Stormur I came across an idea, which I think will suite you. I'm implementing a system of rules which, if evaluated as true for a given token) can change some (not all) CoNLL-U-columns

jheinecke commented 3 years ago

Check version 2.12.0. there is a (first) way to change tokens in an entire file using conditions, like ` child(Upos:VERB && Feat:VerbForm=Part) and child(Upos:DET) > Misc:MyKey=NewCal ` Check the doc and tell me what you think. If it is what you need, I will add an search & replace function based on these conditions

Stormur commented 3 years ago

Hi! Thanks for all the work! We will probably try this feature in an upcoming annotatio nround, I will let you know!

Stormur commented 3 years ago

OK, I tried to try it on a CoNLL-U, but I got this error:

Invalid option --cedit
Conll Error incorrect line: (line 1): Lemma:sum > Upos:AUX (line 1)

There are 2 lines in the correction file, with space-separated fields:

Lemma:sum > Upos:AUX
(Lemma:tuus or Lemma:meus or Lemma:suus or Lemma:noster or Lemma:uoster or Lemma:uester) > Upos:DET
jheinecke commented 3 years ago

Hi, sorry for the late replay (summer ...) did you try

./bin/replace.sh rules.txt data.conllu

You also have to run mvn install after git pull to get the new code compiled

Stormur commented 2 years ago

Hi again! Yes, I understand summer very well! :smile:

It appears that I forgot to recompile everything. Now I briefly tried to run the mass corrections and I succeeded. Thanks again!

Update: is there a functionality to do that "in place", or did I miss it in the documentation?

Update: Whys is the field MISC not available as key in the expression?

jheinecke commented 2 years ago

I simply forgot the MISC field in the expression, But is is now available, I'v just pushed the changes. For the time being only the replace.sh script uses this functionality, but I will add a search-and-replace function in the GUI.

Stormur commented 2 years ago

OK, thanks!

I have another small suggestion: the possibility to include external files in the definition of conditions for rules. For example, imagine that you have to resort to a list of words to decide to apply a given marking: then, one would want to do something like

Lemma:list.txt > Feat:XXX=xxx

We all know that unfortunately some kinds of annotations/corrections are of the idiosyncratic type rather than being definable in a neat way! :smile:

jheinecke commented 2 years ago

That's a good idea. I'll try to do this. I only have to choose a symbol indicates list.txt is not a lemma but a filename, something like LemmaList:list.txt or Lemma:<list.txt (hoping that not treebank will use lemmas starting with <)

jheinecke commented 2 years ago

Try it. The Syntax is finally Lemma:#filename.txt and Form:#anotherfilename.txt. I chose # as symbol since there is no treebank yet which has a form or lemma starting with or containing #.

Since you had the idea for this: If I do a search-and-replace function in the GUI would you like it to ask for confirmation at each change or just change anything (like the ./bin/replace.sh rules.txt data.conllu does already)?

Stormur commented 2 years ago

OK, I tried it and it works.

Just two remarks:

Probably both would be needed: a simple replace button which goes through each case (finds one, then replaces or not, and so on), and a "replace all" if one feels like that! Keeping the possibility to revert the change.

jheinecke commented 2 years ago

I had forgot to add [] in the list of valid characters for feature names, it's now fixed. I also made an empty feature value delete the feature so Feat:Number= will remove the Number-feature from the current word.

Stormur commented 2 years ago

Hi! Here again the massmodifier, I am greatly enjoying this feature, thansk & congratulations again!

I am coming with two new remarks:

Thanks again for everything!

jheinecke commented 2 years ago

Hi, thanks for new ideas :-). I think something like > lemma:token.Form+er won't be difficult to do. However I ddi not get the second thing: How do you want to refer to Feat or Misc. In order to change it, you just say > Lemma:... Feat:Number=Sing Mist:Key=Value but this is probably not what you are thinking of...

Stormur commented 2 years ago

I meant, I might want to refer to specific values of e.g. Feat or Misc. For example (just random): if the token is an AUX, then I want his VerbType be the same as its MISC value Modal and the UPOS VERB. So I envision something like:

UPOS=AUX > upos=VERB and feat:VerbType=token.MISC:Modal

A situation in which this comes handy is when I want to substitute the name of a feature, for example:

!Empty > feat:InflClass=token.feat:NounClass and feat:NounClass=

(noting that if the token has no NounClass feature, the value is empty and in the end nothing happens)

Of course, it could not be limited to the token only, but allow to retrieve values from the head, for example, or maybe even a children? This last one is more difficult...

jheinecke commented 2 years ago

I'm playing around with something like:

conditions > Lemma:"prefix"+head(Lemma)+"suffix"  upos:"NOUN"

which would set the Lemma to the Lemma of the head prefixed by prefix and suffixed by suffix and change the Upos to NOUN

Your examples would translate as (NB. no and necessary on the right side as is already the case in the current version)

Upos:AUX > upos:"VERB"   feat:"VerbType="+this(Misc_Modal)
!Empty   > feat:"InlfClass="+this(Feat_NounClass)   feat:"NounClass="

So the syntax will change in the way that literals (like VERB must be enclosed by quotes. Valued of other columns can be retrieved by this(columnname). I think a substring() and replace() function would be useful too, I'll think about that. I only need some time to get it done ...

jheinecke commented 2 years ago

Hi! I have just pushed something which will be of interest for you. N.B the syntax on the right side of > has changed (see above or better in doc/mass_editing.md

Stormur commented 2 years ago

Hi! This closure actually comes right when I was about to write about some features and editing behaviour after using this wonderful function heavily for data base processing! :nerd_face:

So here it goes:

Thanks again for all the support!!! Ad maiora! :rocket:

jheinecke commented 2 years ago

Hi again,

thanks for these remarks. I'll try to address them from easy to difficult:

Concerning your remark of adding HEAD and DEPS (enhanced dependencies) to the condition and new values: this is feasible, however does it make sense? A rule which changes the head to another token will be so specific, that it is easier to edit it manually (or train a parser which will do it more or less, and than validate manually). Or I could make a condition like Head:-1 which means "if the head is the preceding token". So a rule like

Head:-1 > Feat:"Name=Value"     #  if the token has its preceding word as head, than add the feature `name=value`
Upos:DET > head:"1"             # if the token as the Upos ` DET` than make the following token it's head

I still doubt the usefulness of this, bit since it would be rather easy to implement, I can do it. What do you think of this?

Stormur commented 2 years ago

Here again!

Again, thanks a lot for your support, it's really appreciated. And merry Christmas! :christmas_tree:

jheinecke commented 2 years ago

Hi try version 2.14.1 (git pull ; mvn install). In order to change the Deprel for a token with head 0, use the (new) option --strict with replace.sh I hope the doc is comprehensible, if not, tell me :-)

Happy New Year

Stormur commented 2 years ago

Hi again!

I was newly using the mass-editing tool and enjoying the new features. Now, I got an apaprently harmless error when I try this command:

!IsEmpty and !IsMWT > Lemma:"_" Upos:"_" Xpos:"_" Feat:"_"

that is, I want to strip all actual tokens of those three properties. It seems that everything works, but (when calling replace.sh from a bash script), I get this (I am replacing and changing a bitreal paths with dummy ones):

.../conllueditor/bin/replace.sh: 22: [[: not found
.../conllueditor/bin/replace.sh: 29: [: .../file.conllu: unexpected operator
20727 lines (840 sentences) read

13034 changes for condition: !IsEmpty and !IsMWT  values:  Lemma:"_" Upos:"_" Xpos:"_" Feat:"_"
13034 changes
jheinecke commented 2 years ago

Hi! it looks as if you have used an undocumented mechanism to use older versions :-) bin/replace.sh interprets the first argument as a version-number if it starts with digits. What is the name of your rule file ? I have put all this into comments and pushed a simpler bin/replace.sh. git pull should make it work.

Stormur commented 2 years ago

The filename has no digits, it is Exutor.conllueditor (I am using this moot extension to better sort the files).

Update: I pushed, and now it gives only the error of the kind .../conllueditor/bin/replace.sh: 29: [: .../file.conllu: unexpected operator (but everything works as before).

jheinecke commented 2 years ago

Do you use Linux or Mac and what version of bash is installed? I do not know the ... syntax, it seems that it is stumbling over this. Can you give me the exact line how you call replace.sh ?

Stormur commented 2 years ago

Yes, so, I am using a Linux Ubuntu 20.04.4 LTS "focal fossa". I am using the sh bash and calling this line in a for loop from the bash (again I am readjusting the names for privacy and clarity):

for file in $(ls "$folder/subfolder/prefix"*)
do
    base=$(basename "$file")
    sh "$conllueditor/bin/replace.sh" "$folder/Exutor.conllueditor" "$folder/subfolder/$base" > "$folder/subfolder/newprefix_$base"
done

$conllueditor stores the path of the conllueditor's folder, I am calling from another one.

Just as an aside, I admit that I don't know if the choice of using ls is so wise, but I am not incredibly practical with bash scripts. I accept suggestions in general! :grimacing:

jheinecke commented 2 years ago

I see the problem: You put sh in front of $conllueditor/bin/replace.sh, so replace.sh does not use /bin/bash (indicated in the first line of replace.sh) any more but /bin/sh. At least in my Ubuntu machine /bin/sh is a symbolic link do dash. And dash produces your error. So either remove sh from you loop or use bash instead. I'd prefer removing since usually shell scripts know best which interpreter they need.

jheinecke commented 2 years ago

Instead of for file in $(ls "$folder/subfolder/prefix"*) you can use for file in $folder/subfolder/prefix*. The ls is not necessary, but does not harm either.

Stormur commented 2 years ago

OK, so now everything works smoother! I adjusted the code as you suggested and it's better now. I don't know why I ended up with ls, since it's just an unnecessarily convoluted way to do a simple thing, but I had probably something else in mind before and just retained the snippet.

Thanks, till the next issue!

Stormur commented 1 year ago

As we are near the UD freeze, I am wildly tinkering with data. I'd have a couple of addition suggestion for mass editing:

And something which I fear is much more complicated would be to implement a way to rearrange nodes as part of these rules. Probably some memory is needed, or at least a way to indicate the index of a child/head with a given characteristics. Or I do not know if I am doing something wrong. Anyway: imagine something like A -> B -> C, and that I want to reattach C to A (now feasible), and also reattach B to C (not feasible, because I have changed the head). Am I asking too much here?

jheinecke commented 1 year ago

I'ill have a look for the regex expansion to deprels (shouldn't be that difficult) as well as the second point. Concerning the reattachment: Probably more difficult to implement and test in one week :-) Did you try doing it in two passes. First reattaching C to A (maybee leaving an information in the Misc column of A, B and C) and in a second go reattach B to C ?

Stormur commented 1 year ago

Always thanks for the support! In any case, there is no hurry, These are suggestions for the future.

As for the second point, this is a very interesting way to do that which I did not think of... a little bit twisted, but probably effective and allowing me to not write annoying code just for this! :grimacing:

jheinecke commented 1 year ago

I had a look, the first bit is very easy, the second this() in conditions is very difficult, since in the conditions part, these functions return Conllu-words and not check wether their argument is true. I'll have to change quite a bit, but things like head(Feat:Gender=this(Feat:Gender)) looks indeed very useful to check agreement. So this will take some time ...

jheinecke commented 1 year ago

Hi, try the latest version (2.18.1) I had to change the syntax slightly (see the doc in doc/mass_editing.md): To check whether a column value is identical to a value in a head/child/prec/next, try

@Feat:Gender=head(@Feat:Gender) and Upos:ADJ

This is true, if the current word is an Adjective and it's head has the same value for the feature Gender. If the feature is absent, in the word or head, this expression is evaluated as false.

2 days to data freeze :-)

Stormur commented 1 year ago

Hi!

Probably I am missing it, but is there not a way to check if a node has or has not a given feature? Just an existential check, without necessarily knowing or listing all possible values.

For example, I might be interested in doing something to all nodes with a NumType of any kind, or maybe with those which do not bear any value for InflClass.

jheinecke commented 1 year ago

good point! Seems impossible right now, I'll have a look into it

jheinecke commented 1 year ago

try V2.22.4: if you search for Feat:NumType: without any value it will output all words with the feature NumType. In order to find words without a given feature, use not Feat:NumType:

Stormur commented 1 year ago

Great, thanks! I will put it immediately to work!