Proposal for mass editing

Stormur commented 3 years ago

Hi! As always, first I make you compliments for this application, which we us for annotation with ever great fruitfulness! Great work! :slightly_smiling_face:

I was wondering, since there are some advanced search functions already but no relative point on the TODO-list, if it would be possible to add a function for mass editing based on (regular expression) searches; and in general also including the sentence text in the possible search.

Some dummy examples:

substitute all Imp values of the Tense feature of some given VERB tokens with Tense=Past in the whole document
substitute all commas at the end of a sentence with semicolons in the whole document (less trivial than it seems)

To have such a functionality already integrated in a shell that reads and edits CoNLL-U+ files would be great! Or is it already there and I am missing it?

jheinecke commented 3 years ago

Hi! Thanks :-) It's nice to hear that people like and use my tool (I have no idea how many annotators work with it, drop me a note :-))

Mass editing is a good idea (for the time I do it with more or less intelligent python programmes, but integrate it in the editor would be nice). I will think of that (has to be rather foolproof to avoid major undetected data corruption). What other kind of rules can you think of ? Rules applied on condition on simple tokens does not look to difficult to implement. But rules which take into account the dependencies are trickier (change the Feature F=V to F=V2 if the head of the word has a dependant case ...) Did you think of an interaction with the GUI or a script which reads rules and the conllu-file and outputs the applied rules. If it is this, maybe LORIAs https://grew.fr/ would be a solution ?

concerning CoNLL-U+ there is partial support : The 10 standard CoNLL-U columns must be present, but the editor allows to edit any additional column (without testing whether it makes sense, for instant it does not check whether in a B-I-O tag which spans multiple tokens the first starts with B: and the subsequent with I: If you want to edit CoNLL-U+ which do not contain all 10 standard columns there is a script bin/conlluconvert.sh

Stormur commented 3 years ago

Our team is, but the numbers vary!

First, the last part: I wasn't asking for a particular treatment for CoNLL-U+, but just refering to the capability of conllueditor to treat both CoNLL-U and CoNLL-U plus files!

I am too using more or less standardised Python routines, but as the complexity of our annotational work increases, they become more and more intricated: even just reading and printing syntactic trees is not as simple as it seems!

What I am envisioning for now is just on the simpler level: this would already help greatly. We can think that more complex substitutions, involving for example multiword tokens, conditions on siblings, conditions on other relations and so on can be left to the user to solve in other ways for the moment. So I'd just stick to things such as:

change the deprel of all tokens with given UPOS, lemma and/or deprel, e.g. changing the deprel of all PARTs of the given forms with advmod to discourse
change/delete a feature, if present, based on form/lemma/UPOS and on a combination of other features, e.g. as before change all Tense=Imp of all VERB tokens which also have VerbForm=Fin to Tense=Past, and delete it otherwise
Invert two given tokens in the text based on any possible feature-value combination, e.g. invert all neuter ADJs with a following NOUN at a distance of 1/2/... tokens (I don't why one would do that, but why not)

Admittedly, I have not yet tried to use grew for graph rewriting, but I'll see what it can do!

Stormur commented 3 years ago

By the way, since you mentioned it: could you somehow point me to a contact to a member of the Grew project? I have started trying it out and have questions on some issues, but I seem not to be able to find some contacts. Thanks in advance!

And just as a comment: from what I understand, with a tool like Grew it is not possible to just change e.g. the dependency relation of a node if there is not already an edge, as Grew only works with graphs. But such an edit is a routine for me and it is very useful as a sort of "pre-annotation", and I reiterate that conllueditor would be the perfect place for it! :slightly_smiling_face:

jheinecke commented 3 years ago

I'll look what I can do for the change function you propose. It has to be more foolproof than the rest since you can easily make an error in the find condition and change things which should stay put. This sias there is always git to come back to the point of start.

When annotating new sentences from scratch (for UD_Welsh) I have trained Udpipe on the existing sentences at make it preannotate the new stuff. And than I correct and validate (with some scripts of linguistic knowledge which check checkable things (non ambiguous endings (line odd, always "3 person singular past tense").

For a grew contact, check at the bottom of https://grew.fr/grew_match/help/

Stormur commented 3 years ago

For a grew contact, check at the bottom of https://grew.fr/grew_match/help/

Thanks! ... I feel somewhat stupid, probably I was too hasty. I am editing part of my previous comment.

When annotating new sentences from scratch (for UD_Welsh) I have trained Udpipe on the existing sentences at make it preannotate the new stuff. And than I correct and validate (with some scripts of linguistic knowledge which check checkable things (non ambiguous endings (line odd, always "3 person singular past tense").

I also have a similar routine, but I noticed that quite often I discover systematic errors of the automatic annotation, or also evidence that prompts me to change already annotated sentences, only while working on the annotation itself! This also gives more control, because I find that "post-production" can often be too rough.

OK, I have bothered you enough, thanks again for all the past and upcoming updates! :slightly_smiling_face:

jheinecke commented 3 years ago

HI @Stormur I came across an idea, which I think will suite you. I'm implementing a system of rules which, if evaluated as true for a given token) can change some (not all) CoNLL-U-columns

Upos:ADP and (Lemma:d.* or Lemma:a.*) > Xpos:MyPrep Feat:MyFeature=MyValue
Upos:NOUN and !Feat:Number=Plur > Xpos:NN changeable columns are Form, Lemma, Upos, Xpos, Deprel. To the columns Feat, Deps and Misc new values can be added or existing values be overwritten. Maybe I should add the possibility to delete a Key=Value from Feat and Misc currently it translates as a programme which reads your CoNLL-U file and a list of rules like these and it will make the changes. But I could it also integrate in the GUI of ConlluEditor. What do you think? It is already working, but I have to create some more (unitary) tests to avoid bad surprises...

jheinecke commented 3 years ago

Check version 2.12.0. there is a (first) way to change tokens in an entire file using conditions, like ` child(Upos:VERB && Feat:VerbForm=Part) and child(Upos:DET) > Misc:MyKey=NewCal ` Check the doc and tell me what you think. If it is what you need, I will add an search & replace function based on these conditions

Stormur commented 3 years ago

Hi! Thanks for all the work! We will probably try this feature in an upcoming annotatio nround, I will let you know!

Stormur commented 3 years ago

OK, I tried to try it on a CoNLL-U, but I got this error:

Invalid option --cedit
Conll Error incorrect line: (line 1): Lemma:sum > Upos:AUX (line 1)

There are 2 lines in the correction file, with space-separated fields:

Lemma:sum > Upos:AUX
(Lemma:tuus or Lemma:meus or Lemma:suus or Lemma:noster or Lemma:uoster or Lemma:uester) > Upos:DET

jheinecke commented 3 years ago

Hi, sorry for the late replay (summer ...) did you try

./bin/replace.sh rules.txt data.conllu

You also have to run mvn install after git pull to get the new code compiled

Stormur commented 2 years ago

Hi again! Yes, I understand summer very well! :smile:

It appears that I forgot to recompile everything. Now I briefly tried to run the mass corrections and I succeeded. Thanks again!

Update: is there a functionality to do that "in place", or did I miss it in the documentation?

Update: Whys is the field MISC not available as key in the expression?

jheinecke commented 2 years ago

I simply forgot the MISC field in the expression, But is is now available, I'v just pushed the changes. For the time being only the replace.sh script uses this functionality, but I will add a search-and-replace function in the GUI.

Stormur commented 2 years ago

OK, thanks!

I have another small suggestion: the possibility to include external files in the definition of conditions for rules. For example, imagine that you have to resort to a list of words to decide to apply a given marking: then, one would want to do something like

Lemma:list.txt > Feat:XXX=xxx

We all know that unfortunately some kinds of annotations/corrections are of the idiosyncratic type rather than being definable in a neat way! :smile:

jheinecke commented 2 years ago

That's a good idea. I'll try to do this. I only have to choose a symbol indicates list.txt is not a lemma but a filename, something like LemmaList:list.txt or Lemma:<list.txt (hoping that not treebank will use lemmas starting with <)

jheinecke commented 2 years ago

Try it. The Syntax is finally Lemma:#filename.txt and Form:#anotherfilename.txt. I chose # as symbol since there is no treebank yet which has a form or lemma starting with or containing #.

Since you had the idea for this: If I do a search-and-replace function in the GUI would you like it to ask for confirmation at each change or just change anything (like the ./bin/replace.sh rules.txt data.conllu does already)?

Stormur commented 2 years ago

OK, I tried it and it works.

Just two remarks:

Sometimes one wants to remove a feature: howdoes one do it? I tried Feat:feature=, but it writes exactly like this.
It seems that there is an error while reading a condition on a layered feature like Person[psor]: the problem are the brackets.

Probably both would be needed: a simple replace button which goes through each case (finds one, then replaces or not, and so on), and a "replace all" if one feels like that! Keeping the possibility to revert the change.

jheinecke commented 2 years ago

I had forgot to add [] in the list of valid characters for feature names, it's now fixed. I also made an empty feature value delete the feature so Feat:Number= will remove the Number-feature from the current word.

Stormur commented 2 years ago

Hi! Here again the massmodifier, I am greatly enjoying this feature, thansk & congratulations again!

I am coming with two new remarks:

is it possible to make multiple changes based on the same token? The docs do not express this explicitly.
could it be possible to refer to existent values for the changes? For example, I want to have the lemma like the form, so I'll need something like ... > lemma:token.Form, also with the ability to modify it, like ... > lemma:token.Form+er, which would replace a couple like thing unknown with thing thinger, and so on. Also, with the possibility to refer to specific fields of Feat and MISC.

Thanks again for everything!

jheinecke commented 2 years ago

Hi, thanks for new ideas :-). I think something like > lemma:token.Form+er won't be difficult to do. However I ddi not get the second thing: How do you want to refer to Feat or Misc. In order to change it, you just say > Lemma:... Feat:Number=Sing Mist:Key=Value but this is probably not what you are thinking of...

Stormur commented 2 years ago

I meant, I might want to refer to specific values of e.g. Feat or Misc. For example (just random): if the token is an AUX, then I want his VerbType be the same as its MISC value Modal and the UPOS VERB. So I envision something like:

UPOS=AUX > upos=VERB and feat:VerbType=token.MISC:Modal

A situation in which this comes handy is when I want to substitute the name of a feature, for example:

!Empty > feat:InflClass=token.feat:NounClass and feat:NounClass=

(noting that if the token has no NounClass feature, the value is empty and in the end nothing happens)

Of course, it could not be limited to the token only, but allow to retrieve values from the head, for example, or maybe even a children? This last one is more difficult...

jheinecke commented 2 years ago

I'm playing around with something like:

conditions > Lemma:"prefix"+head(Lemma)+"suffix"  upos:"NOUN"

which would set the Lemma to the Lemma of the head prefixed by prefix and suffixed by suffix and change the Upos to NOUN

Your examples would translate as (NB. no and necessary on the right side as is already the case in the current version)

Upos:AUX > upos:"VERB"   feat:"VerbType="+this(Misc_Modal)
!Empty   > feat:"InlfClass="+this(Feat_NounClass)   feat:"NounClass="

So the syntax will change in the way that literals (like VERB must be enclosed by quotes. Valued of other columns can be retrieved by this(columnname). I think a substring() and replace() function would be useful too, I'll think about that. I only need some time to get it done ...

jheinecke commented 2 years ago

Hi! I have just pushed something which will be of interest for you. N.B the syntax on the right side of > has changed (see above or better in doc/mass_editing.md

Stormur commented 2 years ago

Hi! This closure actually comes right when I was about to write about some features and editing behaviour after using this wonderful function heavily for data base processing! :nerd_face:

So here it goes:

it's not completely clear to me how I can simply cancel a field, e.g. all of the morphology, or even a lemma, anything. I tried something like > Feats:, but it seems I get an error. I also tried something like > Xpos:"_", but it was again an error.
- related to this, sometimes I'd like to just remove a filed for any token, so the condition could be that a node is simply not empty. But again, it seems I am getting errors with conditions sucha as !Empty.
Would it be possible to add also all other fields as options for editing? HEAD and DEPS are missing bot from conditions and new values.
Maybe I have not tried it, but is it possible to use logical operators also inside the conditions? Something like Upos:(VERB || AUX) instead of Upos:VERB || Upos:AUX
I noticed that new values for DEPREL do not change the root value. This may be touching something fundamental, but this happened to me when I was trying to strip all nodes of their dependency relations: it succeded, but root remained there. In general, though paradoxical it may sound, could a syntaxless mode be thinkable for CoNLL-U? (Knowing that UD treebanks require those fields at least, but the format might be used also outside of that context.)
I had problems integrating the mass editing into another script, in particular a bash script (on Ubuntu). It seems that the process cannot be launched from outside the bin folder, it raises an error (even if I give complete paths for all variables).
Am I missing something, or is it possible to launch mass editing on more files at once?

Thanks again for all the support!!! Ad maiora! :rocket:

jheinecke commented 2 years ago

Hi again,

thanks for these remarks. I'll try to address them from easy to difficult:

You should be able to call replace.sh from anywhere on your filesystem. I tried it lie path/to/conllueditor/bin/replace.sh rulefile conllufile and it worked without problem. Maybe on your Ubuntu box the package coreutils is not installed (which provides readlink, used in replace.sh. If readlink is available and it still does not work, please send me the error messages.
currently you can only call replace.sh with a single CoNLL-U input file since the output is written to stdout
it is not possible to use logical operators also inside the conditions? Something like Upos:(VERB || AUX) instead of Upos:VERB || Upos:AUX If I'll find the time, I will think of doing that :-) but it's not on the top priority list ...
emptying XPOS etc works like Xpos:"_", I'll add a way to remove all features from the FEAT column (and similarly all key=values from Misc:
- Feat:"_" and Misc:"_" will remove all key=value pairs
the Empty / !Empty causes an error if something other than Form and Misc was to be changed for a multitoken word. I'll change this to a warning (and will add a condition like isMWT)
I noticed that new values for DEPREL do not change the root value. Currently when a sentence is serialized into CoNLL-U, if the Head is 0, its Deprel is automatically set to root, independently of what a rule may have changed. Since the CoNLL-U format requires this, I prefer to keep this to avoid producing invalid formats. I'll think of a non CoNLL-U mode

Concerning your remark of adding HEAD and DEPS (enhanced dependencies) to the condition and new values: this is feasible, however does it make sense? A rule which changes the head to another token will be so specific, that it is easier to edit it manually (or train a parser which will do it more or less, and than validate manually). Or I could make a condition like Head:-1 which means "if the head is the preceding token". So a rule like

Head:-1 > Feat:"Name=Value"     #  if the token has its preceding word as head, than add the feature `name=value`
Upos:DET > head:"1"             # if the token as the Upos ` DET` than make the following token it's head

I still doubt the usefulness of this, bit since it would be rather easy to implement, I can do it. What do you think of this?

Stormur commented 2 years ago

Here again!

I already had an updated coreutils, and now I have tried again calling the modifier, the rules and the file all from different folders, and it works indeed. What I stumbled upon was probably a mixture of the other issues.
Modifying HEAD and DEPS and everything: I think it does! Or better: why not have it? Now, what I was trying to do this exact time was just to remove all values. But it already happened to me to identify some constructions which needed a small retransformation of their syntactic subtrees, e.g. a generalised inversion of copulas (so with a condition like head(Upos:AUX) > ..., then detaching and reattaching nodes), or small movements like conjunctions, particles, and so on... Manual editing can be very time consuming when such cases are repeated systematically in a corpus. Now, it is true that, if not done carefully, this can produce a non-tree graph or else disrupt the validity of the CoNLL-U. But isn't this checked anyway? The positional head argument might be useuful. But in general one would want to retrieve the index of its head regardless of linear ordering. Anyway, if it's easy, let's do this! One never knows how one wants to mess with its data! :smile:

Again, thanks a lot for your support, it's really appreciated. And merry Christmas! :christmas_tree:

jheinecke commented 2 years ago

Hi try version 2.14.1 (git pull ; mvn install). In order to change the Deprel for a token with head 0, use the (new) option --strict with replace.sh I hope the doc is comprehensible, if not, tell me :-)

Happy New Year

Stormur commented 2 years ago

Hi again!

I was newly using the mass-editing tool and enjoying the new features. Now, I got an apaprently harmless error when I try this command:

!IsEmpty and !IsMWT > Lemma:"_" Upos:"_" Xpos:"_" Feat:"_"

that is, I want to strip all actual tokens of those three properties. It seems that everything works, but (when calling replace.sh from a bash script), I get this (I am replacing and changing a bitreal paths with dummy ones):

.../conllueditor/bin/replace.sh: 22: [[: not found
.../conllueditor/bin/replace.sh: 29: [: .../file.conllu: unexpected operator
20727 lines (840 sentences) read

13034 changes for condition: !IsEmpty and !IsMWT  values:  Lemma:"_" Upos:"_" Xpos:"_" Feat:"_"
13034 changes

jheinecke commented 2 years ago

Hi! it looks as if you have used an undocumented mechanism to use older versions :-) bin/replace.sh interprets the first argument as a version-number if it starts with digits. What is the name of your rule file ? I have put all this into comments and pushed a simpler bin/replace.sh. git pull should make it work.

Stormur commented 2 years ago

The filename has no digits, it is Exutor.conllueditor (I am using this moot extension to better sort the files).

Update: I pushed, and now it gives only the error of the kind .../conllueditor/bin/replace.sh: 29: [: .../file.conllu: unexpected operator (but everything works as before).

jheinecke commented 2 years ago

Do you use Linux or Mac and what version of bash is installed? I do not know the ... syntax, it seems that it is stumbling over this. Can you give me the exact line how you call replace.sh ?

Stormur commented 2 years ago

Yes, so, I am using a Linux Ubuntu 20.04.4 LTS "focal fossa". I am using the sh bash and calling this line in a for loop from the bash (again I am readjusting the names for privacy and clarity):

for file in $(ls "$folder/subfolder/prefix"*)
do
    base=$(basename "$file")
    sh "$conllueditor/bin/replace.sh" "$folder/Exutor.conllueditor" "$folder/subfolder/$base" > "$folder/subfolder/newprefix_$base"
done

$conllueditor stores the path of the conllueditor's folder, I am calling from another one.

Just as an aside, I admit that I don't know if the choice of using ls is so wise, but I am not incredibly practical with bash scripts. I accept suggestions in general! :grimacing:

jheinecke commented 2 years ago

I see the problem: You put sh in front of $conllueditor/bin/replace.sh, so replace.sh does not use /bin/bash (indicated in the first line of replace.sh) any more but /bin/sh. At least in my Ubuntu machine /bin/sh is a symbolic link do dash. And dash produces your error. So either remove sh from you loop or use bash instead. I'd prefer removing since usually shell scripts know best which interpreter they need.

jheinecke commented 2 years ago

Instead of for file in $(ls "$folder/subfolder/prefix"*) you can use for file in $folder/subfolder/prefix*. The ls is not necessary, but does not harm either.

Stormur commented 2 years ago

OK, so now everything works smoother! I adjusted the code as you suggested and it's better now. I don't know why I ended up with ls, since it's just an unnecessarily convoluted way to do a simple thing, but I had probably something else in mind before and just retained the snippet.

Thanks, till the next issue!

Stormur commented 1 year ago

As we are near the UD freeze, I am wildly tinkering with data. I'd have a couple of addition suggestion for mass editing:

expanding regular expressions at least to deprels, too: sometimes one just wants to look for a subtype, or for a relation with any subtype;
allowing the this function also for conditions: some rules might e.g. depend on agreement of some feature, so something on the line of head(Gender=this(Gender)) would be needed

And something which I fear is much more complicated would be to implement a way to rearrange nodes as part of these rules. Probably some memory is needed, or at least a way to indicate the index of a child/head with a given characteristics. Or I do not know if I am doing something wrong. Anyway: imagine something like A -> B -> C, and that I want to reattach C to A (now feasible), and also reattach B to C (not feasible, because I have changed the head). Am I asking too much here?

jheinecke commented 1 year ago

I'ill have a look for the regex expansion to deprels (shouldn't be that difficult) as well as the second point. Concerning the reattachment: Probably more difficult to implement and test in one week :-) Did you try doing it in two passes. First reattaching C to A (maybee leaving an information in the Misc column of A, B and C) and in a second go reattach B to C ?

Stormur commented 1 year ago

Always thanks for the support! In any case, there is no hurry, These are suggestions for the future.

As for the second point, this is a very interesting way to do that which I did not think of... a little bit twisted, but probably effective and allowing me to not write annoying code just for this! :grimacing:

jheinecke commented 1 year ago

I had a look, the first bit is very easy, the second this() in conditions is very difficult, since in the conditions part, these functions return Conllu-words and not check wether their argument is true. I'll have to change quite a bit, but things like head(Feat:Gender=this(Feat:Gender)) looks indeed very useful to check agreement. So this will take some time ...

jheinecke commented 1 year ago

Hi, try the latest version (2.18.1) I had to change the syntax slightly (see the doc in doc/mass_editing.md): To check whether a column value is identical to a value in a head/child/prec/next, try

@Feat:Gender=head(@Feat:Gender) and Upos:ADJ

This is true, if the current word is an Adjective and it's head has the same value for the feature Gender. If the feature is absent, in the word or head, this expression is evaluated as false.

2 days to data freeze :-)

Stormur commented 1 year ago

Hi!

Probably I am missing it, but is there not a way to check if a node has or has not a given feature? Just an existential check, without necessarily knowing or listing all possible values.

For example, I might be interested in doing something to all nodes with a NumType of any kind, or maybe with those which do not bear any value for InflClass.

jheinecke commented 1 year ago

good point! Seems impossible right now, I'll have a look into it

jheinecke commented 1 year ago

try V2.22.4: if you search for Feat:NumType: without any value it will output all words with the feature NumType. In order to find words without a given feature, use not Feat:NumType:

Stormur commented 1 year ago

Great, thanks! I will put it immediately to work!

Orange-OpenSource / conllueditor

Proposal for mass editing #12