Concurrency problem - Githubissues

GoogleCodeExporter commented 8 years ago

Currently Rosetta stores the entire .po file in a Django session. This means if 
two translators are looking at the same file, they will overwrite each other's 
changes, as the file is loaded only once, when the translator picks the file.

The attached patch reloads the file each time the home() view is loaded, if the 
.po file is editable. This should, hopefully, get translators to import each 
other's changes frequently enough to make concurrency less of an issue. Note 
that this patch does not resolve the concurrency problem. There are still no 
conflict detection/resolution code, nor any locks. In particular, if both 
translators happen to be editing the same page, overwriting can still happen.

Original issue reported on code.google.com by alass...@gmail.com on 15 Jun 2010 at 4:58

Attachments:

views.py.patch

GoogleCodeExporter commented 8 years ago

Thank you for this patch. Concurrency has been bothering me (and our clients) 
quite alot and I've been thinking about the best way to resolve the problem. 
You mentioned file locks...

There are related issues, like what'd happen if the original po file gets 
updated (through makemessages/xgettext) while someone is editing it.

I'll give your patch a try, though, and write a couple (much needed) test cases.

Original comment by mbonetti on 15 Jun 2010 at 6:30

GoogleCodeExporter commented 8 years ago

Thanks for the response. Specifically, the overwriting I mentioned happens if:

1. A string shows up on both translators' page;
2. Translator A makes a change to the string, and saves;
3. Translator B saves;
4. Translator A's change is lost, even if B did not changes that string.

This is something that might possibly be fixed by detecting if a field has been 
changed before updating the .po file. If B did change the field, it's 
reasonable to let him overwrite A's change.

Regarding to the .po file being updated elsewhere, I found a bug with my patch: 
Rosetta uses sequential index instead of msgids; if a string is removed or 
added, the .po file will have a different index from the one Rosetta knows, 
which might lead to submitted translations being put in wrong slots. This can 
be fixed by using POFile.find(msgid) to get the right entry, but I'm not sure 
what's the best way to pass around msgid. One idea is to use a hash of msgid 
instead of sequential id, then either iterate through the new .po file to 
compare hashes, or keep a hashtable in Rosetta code to get the msgid, or preach 
polib to add the hash as a possible lookup field. The other is to get it from 
the cached POFile in session, which would have the right order of entries. What 
do you think?

Original comment by alass...@gmail.com on 16 Jun 2010 at 10:23

GoogleCodeExporter commented 8 years ago

A colleague of mine mentioned the possibility of using Django's cache framework 
in the case of read-only .po files. Because Django cache is per-site, using it 
would make handling read-only .po files more similar to the normal case. 
Instead of storing the entire file in a session, we can just store a hashtable 
of translation entries that are on the user's current page.

Original comment by alass...@gmail.com on 16 Jun 2010 at 11:18

GoogleCodeExporter commented 8 years ago

A hashtable-based solution using  sha1 or md5 hashes of the msgid as keys is a 
good idea to avoid clashes, but as you mention, there is no easy way to update 
the underlying POFile object.

Using the cache framework is tempting, but we cannot relay on the fact that the 
user has enabled caching. Also, IIRC locmem is the default setting, which is 
thread- and process-safe (i.e. not site-wide), so it'd be equivalent to using 
the session anyway.

Original comment by mbonetti on 16 Jun 2010 at 11:54

GoogleCodeExporter commented 8 years ago

I had a go at hashing. Basically I changed the code attaching sequential id to 
the entry objects to use a md5 hash instead; the hash is then used to look up 
the entry in the reloaded POFile using find(). This is not a proper hashtable 
solution - it relies on md5 hashes being unique. In our case, since we're 
dealing with natural languages, I think the chance of collision is negligible.

Original comment by alass...@gmail.com on 30 Jun 2010 at 1:35

Attachments:

views.py.patch

GoogleCodeExporter commented 8 years ago

Thank you for the patch, I'll give it thorough look as soon as I can get some 
free time.

Original comment by mbonetti on 2 Jul 2010 at 1:40

GoogleCodeExporter commented 8 years ago

A quick update to let you know I applied both patches and all the unit tests 
run fine. I'll write another couple tests to simulate concurrency related 
issues and I'll commit this. Thank you for your help!

Original comment by mbonetti on 3 Jul 2010 at 1:12

Changed state: Started

GoogleCodeExporter commented 8 years ago

Cheers, this is fixed as of r98. 

I slightly changed your approach in that I use md5(msgid) + md5(msgstr) as the 
hash for the individual entries. 

This has the advantage that if user B submits a translation for a string which 
was translated by user A after it was last loaded by B, the hash won't match 
(because md5(msgstr) has changed) and we can warn user B by displaying a scary 
warning note.  I added a couple unit tests to cover these cases.

Thanks again for your support on this.

Original comment by mbonetti on 4 Jul 2010 at 11:27

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Thanks, would you be doing a new release soon?

Original comment by alass...@gmail.com on 6 Jul 2010 at 2:07

GoogleCodeExporter commented 8 years ago

definitely, I was just hoping that I'd get some feedback if anything was wrong 
with the trunk...

Original comment by mbonetti on 6 Jul 2010 at 2:17

carlxaeron / django-rosetta

Concurrency problem #77