Annotald / annotald

A program for annotation in the Penn Treebank format
GNU General Public License v3.0
8 stars 3 forks source link

Sentence deleted in annotald #66

Closed diertani closed 9 years ago

diertani commented 9 years ago

I inadvertently deleted a root node while parsing, so that all of the consituents contained in that token became tokens of their own. Somewhat reflexively, I hit Undo, and instead of either undoing it or doing nothing (the latter being what I actually expected), it just.... deleted the token entirely. So now, in the file /home/migration/other/MIDENG/PPCMBE/psd/stage-2/DIERTANI/erv-1881-acts.psd, there is a token 478 and a token 480, but no token 479. I have been unable to undo this.

diertani commented 9 years ago

Just an FYI: I had to download and then upload both the file and the .bak file to recover the sentence because of the limitations on emacs with the remote connection, but I have successfully restored the missing token in the file.

beatrice57 commented 9 years ago

youch - this looks reasonably serious.

On Sep 18, 2014, at 10:21 , diertani notifications@github.com wrote:

Just an FYI: I had to download and then upload both the file and the .bak file to recover the sentence because of the limitations on emacs with the remote connection, but I have successfully restored the missing token in the file.

— Reply to this email directly or view it on GitHub.

aecay commented 9 years ago

From what I can reproduce, it looks like the "missing" token would in fact have been moved to the end of the file -- can you see if that was the case? If so, then I will be confident that I have squashed this bug, and I'll make a new version of annotald and have it updated on babel.

beatrice57 commented 9 years ago

thanks for looking into this so quickly, aaron.

thank goodness the token wasn’t actually deleted!

On Sep 20, 2014, at 2:42 PM, Aaron Ecay notifications@github.com wrote:

From what I can reproduce, it looks like the "missing" token would in fact have been moved to the end of the file -- can you see if that was the case? If so, then I will be confident that I have squashed this bug, and I'll make a new version of annotald and have it updated on babel.

— Reply to this email directly or view it on GitHub.

diertani commented 9 years ago

It doesn't appear to have happened in the file I was working on. I can only find one copy of token 479, and it's the one I pasted in from the .bak file; the last token is 1082.

I forgot -- there was one other thing about the bug, which I only noticed when I went in to perform transplant surgery. The token wasn't entirely deleted: the punctuation was still there, although everything else was gone including the token number was gone (in emacs).

aecay commented 9 years ago

Well I'm puzzled. I definitely fixed some bugs related to undo, but none of them should have had the symptoms that you observed. I've asked Vince to upgrade annotald and pull the new fixes onto babel. When that is done, I'll post another message here.

When that happens, could you try doing the thing you did to trigger the bug using a temporary copy of one of the corpus files, just to verify that everything behaves as expected without putting any data in danger?

There is also an opt-in feature in annotald that is supposed to verify the integrity of the text, and would hopefully catch any unadverted such deletions in the future. It involves using a script to add a special cookie to the beginning of each file which contains a checksum of the file's text; annotald then verifies that the checksums match on every save (and complains if they don't). Would you like to try that? If so, I can give you directions to enable it.

diertani commented 9 years ago

Yeah, I can definitely do that.

The checksum cookie sounds worth enabling. It only checks the text?

aecay commented 9 years ago

The update to annotald has been applied on babel, so you can commence testing the fix whenever is convenient for you.

Yes, the checksum cookie only checks the text of the file. So you should be able to edit the structure in arbitrary ways without changing the hash. But if the text is changed, then you will get an error message when you save. (You can override the error and force save if that is the correct thing to do, just as with the "non matching invocations" error.) The algorithm is smart enough to ignore traces and empty categories when determining what is the "text" in the file.

In order to add the checksum cookie to a file, execute the following command:

annotald-aux hash-file file.psd

The file will be re-written in place with a line like the following at the beginning:

( (VERSION (HASH (MD5 305c68f63132f992dd20373adbfcc55e))))

The hash value will obviously be different for different files. When annotald sees such a line in your file, it will do the checksum comparison on save. If you get a mismatched checksum error, I'd advise you to use the command line to make a manual backup copy of the file and then force save. You can then compare the two copies of the file. (Let me know if you need help with this). Any such errors that result will point to either a data loss bug in annotald, or a less-serious bug in the computation of the hashes themselves.

If you ever want to de-cookie a file, you can use a text editor to delete the "VERSION" line and the following blank line from the file.

diertani commented 9 years ago

I tried out the de-bugged version on a test file (/home/migration/other/MIDENG/PPCMBE/psd/stage-2/DIERTANI/erv-numbers-test.psd), and tried deleting root nodes and then undoing it. I did it three times, and each time, the node was resurrected at the beginning of the file. So sentences are not being deleted, but they're still ending up in funny places.

diertani commented 9 years ago

I ran annotald-aux hash-file on a test file (the imaginatively named TEST.psd) and got the following message:

Computing hash for TEST.psd.
This file has no version cookie; adding.
Traceback (most recent call last):
  File "/usr/local/bin/annotald-aux", line 176, in <module>
    args.func(args)
  File "/usr/local/bin/annotald-aux", line 32, in hash_file
    new_hash = annotald.util.hashTrees("\n\n".join(trees), vc)
  File "/home/diertani/.local/lib/python2.7/site-packages/annotald/util.py", line 413, in hashTrees
    text = " ".join(map(fn, trees))
  File "/home/diertani/.local/lib/python2.7/site-packages/annotald/util.py", line 403, in _getText
    l = reduce(_squashAt, l)
TypeError: reduce() of empty sequence with no initial value

I'm not seeing the hash line in the file. Did I do this wrong somehow?

Question about this part:

When annotald sees such a line in your file, it will do the checksum comparison on save. If you get a mismatched checksum error, I'd advise you to use the command line to make a manual backup copy of the file and then force save. You can then compare the two copies of the file. (Let me know if you need help with this). 

I don't usually see the command line when I have annotald open and running (until I've exited the programme) because the terminal is full of annotald. Is the only way to make a backup copy to log in on a separate session?

aecay commented 9 years ago

Addressing your second comment: Indeed, there's a bug that causes the error you see. I don't want to bother Vince for another update of annotald so soon, so you'll have to use my copy of annotald-aux:

~ecay/.local/bin/annotald-aux hash-file foo.psd

As for your second question, yes you would have to open a second ssh connection (no tunnel needed) to make the backup copy.

I thought I had really fixed the undo bug...I'll take another look at it later this afternoon.