Open davidskalinder opened 4 years ago
@alexhanna, we just sent this to our coders, which we hope will both create a workaround and gather some data on how often we've lost data. You might want to do something similar since this is almost certainly causing at least some data loss for y'all as well.
We've recently noticed a bug in our web interface that might be causing some data loss, so we want to ask you how often you think it happens and also let you know how best to work around it.
The bug happens when duplicated items get onto the "Text Select" tab, like this:
This should only happen by accident: for example, if you double-clicked instead of single-clicked the "+" sign, or if you forgot to change the text selection when you meant to add text to a different field. The bug occurs when you click the "x" to delete one of the duplicates: the screen will show you that only one entry has been deleted, but in fact all of the entries with that text for that field in that event will be deleted from the database.
So first, we want to know how often you think you have accidentally added some text twice and have then removed one of the entries? For example, is this something that happens to you in almost every article, or perhaps once a day, or once a week, or once a month, or never? Please reply to this email and tell us as well as you can remember so that we can try to make sure that we can recover as much as possible of the great work you've done! If you can’t recall this even happening, that would also be helpful to know.
Second, to work around this problem from now on, please do not delete any duplicate entries you accidentally create on the Text Select tab. Just leave 'em in there: it's easy for us to ignore the duplicates later, and we can also use the duplicate entries to build a better idea of how often this bug has occurred in the past. This problem only happens if you have accidentally duplicated marked text and are trying to delete the duplicate but leave behind the original text. If you what you have marked is just wrong but not a duplicate, go ahead and delete it.
Here's where the rubber meets the road for this one:
Note that the value
field contains the position of the selection within the article, not the selected characters themselves. So to be clear, the problem only happens when the duplicate entries both came from the same place in the article and are coded for the same variable.
@olderwoman, @matthewsmorgan, and @alexhanna, I think I need your input on this. I think there are several ways we could fix this bug in the long term, and I'm not sure which is best.
The current behavior for a variable with multiple entries of text captured from the same position in an article is that when one entry is deleted, only that entry is deleted from the UI but all matching entries are deleted from the DB. So to fix this, we could do any of these:
Cost of implementation is probably about the same for all three.
It seems to me that there is no case for which we'd ever want to collect the same text in the same position multiple times for one variable, so my inclination is that option 1 is best? But maybe I'm missing some reason why we might occasionally want duplicate entries? Or overlooking another way to fix this?
Sorry for not addressing this in more detail, but option 1 makes the most sense to me. I can't think of a theoretical reason why you'd want the same text in the DB more than once.
The cost of implementation may not be more onerous for (1), but the computational cost could be because to do this operation in O(1) time you either have to put a unique constraint on the variable, value, and text columns, or you have to do an O(n) scan of the table first.
Isn't (3) the current behavior?
(2) would be the safest, since you can just key the delete operation to the primary key in the table. I think.
The cost of implementation may not be more onerous for (1), but the computational cost could be because to do this operation in O(1) time you either have to put a unique constraint on the variable, value, and text columns, or you have to do an O(n) scan of the table first.
Is O(n) what happens when we do something like len(execute("SELECT * FROM coder_event_creator WHERE event_id = [whatever] AND variable = [whatever] AND value = [whatever]")) > 0
? And if so, like, do we really care about O(n) here?
Though frankly, a unique constraint might be the most elegant solution -- I hadn't thought of that. That would hardly have any cost at all that I can see aside from a bit of storage space for the index? Once you've solved the DB migration problem of course. :)
- When an entry is deleted, delete all matching entries from the DB and the UI
Isn't (3) the current behavior?
Nope. Current behavior is:
...which is the bug.
(2) would be the safest, since you can just key the delete operation to the primary key in the table. I think.
Yep, agreed, that's the one that gives the coders the most control. But I think (1) is cleaner and follows the logic of the coding and analysis better? (If it doesn't cause unforeseen problems of course?)
And if so, like, do we really care about O(n) here?
Our n
right now is around 200,000, so yes.
Unique constraint would make most sense but yeah it'd require a DB migration.
- When an entry is deleted, delete only that item from the UI and all matching entries from the DB
Oh, that behavior is gross. That means there's a mismatch between the UI and the DB. (3) is kind of confusing, though.
(2) doesn't require a database migration, so there's that. But yeah it'd require reprogramming some AJAX, which I know is your favorite.
One question I forgot to ask. What is the behavior if text is marked first for one variable and then another and then delete one of them? E.g. first I mark a name as a named organization, then mark it as an actor, then delete named organization. Does that also produce the bug? Or only if the variable is also the same?
Assuming it occurs when the variable is also the same, it probably saves everybody the most work if the action is blocked with error message “this text already marked for this field.”
Here's where the rubber meets the road for this one:
Note that the value field contains the position of the selection within the article, not the selected characters themselves. So to be clear, the problem only happens when the duplicate entries both came from the same place in the article and are coded for the same variable.
Good. Same variable same place. Addresses the comment I made before seeig this.
And if so, like, do we really care about O(n) here?
Our
n
right now is around 200,000, so yes.Unique constraint would make most sense but yeah it'd require a DB migration.
I'm already desensitized to running ALTER TABLE
s on the back end, so this is looking better and better to me. But let's talk about DB migration on Wednesday. I'll also ask you some further now-even-more-academic-than-usual questions about computational time for simple SELECT
queries.
(3) is kind of confusing, though.
It could work with an error message maybe, but I agree that the other options are probably clearer and better.
(2) doesn't require a database migration, so there's that. But yeah it'd require reprogramming some AJAX, which I know is your favorite.
Meh, I was figuring that all of them would need some AJAX tweaks until you suggested the unique constraint. Although #85 is probably as good a lesson as any of us will need of why to stay paranoid...
Assuming it occurs when the variable is also the same, it probably saves everybody the most work if the action is blocked with error message “this text already marked for this field.”
I wonder whether it'd even need the error message? If the user reasonably believes that clicking the plus means to include the marked text in the field, then not changing anything when it's already in the list is the expected behavior? Although you're right that more communication rarely hurts...
I think option 1 is fine, assuming it's text within the same field (e.g., text marked for actor). I'd like students to still have the option to mark something as "actor" and "named organization," for example, if it makes sense in that case or if they're unsure of what to do. But it seems like that's not what is causing the bug.
If we move forward with the above option, I think we won't necessarily need the error message. It sounds like, from what students are reporting in their emails to us (which in several cases it sounds like this is happening at least once per week), they often click twice because the text doesn't immediately show after they hit the button. It sounds like this is an issue with slow internet connections, but I can say from the Pass 1 coders' perspective it's sometimes difficult to tell if the internet is slow or if you just didn't hit the button the first time because they're small. That's just to say, as long as they see that the text is there I don't think they'd click the button a second time, so the error message may be an annoyance.
On Mon, Jun 8, 2020 at 8:27 PM davidskalinder notifications@github.com wrote:
Assuming it occurs when the variable is also the same, it probably saves everybody the most work if the action is blocked with error message “this text already marked for this field.”
I wonder whether it'd even need the error message? If the user reasonably believes that clicking the plus means to include the marked text in the field, then not changing anything when it's already in the list is the expected behavior? Although you're right that more communication rarely hurts...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davidskalinder/mpeds-coder/issues/89#issuecomment-640973520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN4I3JVTKHGIRZ57KVSOK6LRVWFW7ANCNFSM4NQDEXXQ .
Okay thanks folks. Looking pretty good for option 1, but I'll hold off on implementation until @alexhanna and I catch up about DB migration strategy tomorrow.
Thanks, @davidskalinder. I merged the temporary fix into our instantiation.
All right, so now that #92 is just about wrapped up, I think we should start paying attention to this one again.
Adding a unique constraint to the DB seems like the most elegant solution by a good margin, but it's becoming increasingly clear to me that making this issue dependent on me figuring out a proper DB migration setup (#97) means that it almost certainly won't happen for a while (probably after my Aug. 3 prelim and recovery therefrom).
So @alexhanna, how do we feel about going ahead now and adding the constraint the old stupid way? Or would we rather keep the temporary solution from #92 in place until we get Alembic set up, even if that's a couple months away?
The answer from @alexhanna and I's conversation: yeah, let's do it. But even the usual complicated manual-DB-migration process will be extra complicated in this case since adding a unique constraint will require deleting the old duplicates (which should probably be marked somehow for further attention?)... So I'll try to come up with some as-tidy-as-possible SQLs to do all that as part of the migration.
CCing @alexhanna since this is likely to affect her team also.
To reproduce, select some text and add it to a text-select variable twice, then delete one of them. The one remaining will be displayed, but both entries will be deleted from the database. Reloading the page will show that all of the entries of the variables with that text value are gone.
This is hopefully fairly rare, but it's very likely to cause some data loss when it happens, so we should fix this.