TheRosettaFoundation / SOLAS-Match

Self-managed translation project interface
www.TheRosettaFoundation.org
GNU Lesser General Public License v3.0
12 stars 8 forks source link

Automatic Word Count Generation #1276

Closed alanbarrett closed 7 years ago

alanbarrett commented 7 years ago

@mari-na requested Automatic Word Count Generation. We will provide this with API calls to a MateCat instance (provided by @mirko33 ).

When a new Project or Task is created. I will generate records in the database indicating that the user entered word count needs to be updated (overwritten). Later we will add wording to indicate for what file types, the project submitter does not need to enter a word count.

On "cron", we will every minute run a web page (PHP (using slim) script) which will submit outstanding projects to MateCat and a minute later (subsequent cron trigger) request the word count and update the projects and tasks.

Any comments/suggestions?

Alan.
alanbarrett commented 7 years ago

I have now completed the code for this. It has not been tested (or even checked for syntax errors). Here are the changes: https://github.com/TheRosettaFoundation/SOLAS-Match/compare/word-count

I have some questions...

1) I have done preliminary API tests with www.matecat.com. This seems to return good word counts! I am slightly worried that the clean copy of MateCat that @mirko33 is planning to install might possibly not have all the word count functionality. Does it use Microsoft Office or proprietary software that might not be freely available to us? That thought just occured to me last night.

2) If the initial call to MateCat fails, unless it was a comms failure, I give up and never try that file again. Is that reasonable? It could be that MateCat does not support that file and I don't want to try forever.

3) If the status call to MateCat fails one minute later, I do give up on that file (my assumption is that the comms should have not gone down within a minute).

4) I am dealing with the word count for a new project and its created tasks. I am currently not dealing with creating an additional task or with creating (de)segmentation tasks.

5) MateCat deals with a restricted set of source languages. If I get a language currently not covered by MateCat, I pretend the file languages is 'en-US' and hope the word count works with that.

6) I always set the target language to 'es-ES' as I do not want MateCat to return errors based on the target language.

@mirko33 when do you think you will have MateCat installed?

Alan.
mirko33 commented 7 years ago
  1. The MateCat filter capability is based on open source technology which in turn is made available in a public (but paid-for) API which TWB is subscribed to. So we will be using the exact same filters involved in word counting as does Matecat.com
  2. I agree
  3. This may not work for very large documents, but then I presume they're rare and would need some manual intervention anyway.
  4. Fair enough, although results may be unexpected for certain scripts
  5. Makes sense.
alanbarrett commented 7 years ago

I have finished coding this and testing it on the dev server.

See commit 0068a6cac79766ca82c3c658431d8132de57b819

@mirko33 , I need to know when you will be ready for me to test with Kató and what URL to use.

Alan.
mirko33 commented 7 years ago

Great, thanks. I expect the new MateCat instance to be available by end of day Monday (June 19). We're setting up a new dedicated one finally, rather than using an existing dev instance.

alanbarrett commented 7 years ago

@mirko33 , @mari-na ,

I've been looking at matching up Trommons languages with MateCat Languages so we can pick target languages in MateCat that can pretty accurately reflect the target language/country pairs in Trommons (Trommons allows any pair while MateCat has a more restricted set).

The reason to do that would be to validly setup the targets in MateCat rather than pick "es-ES" as my code currently does. Is there a value to this? Would we use these target language pairs? Obviously this would be to do some analysis of potential translations or do actual translation in MateCat. Is there any point thinking about this at present?

Also what will the New and Status URLS be in Kató. For matecat.com I am using https://www.matecat.com/api/new and https://www.matecat.com/api/status

Alan.
mirko33 commented 7 years ago

Yes, I think that's a good forward-looking idea. We keep running into problems with our "Workspace" because it uses an undocumented set of proprietary language codes and have to maintain a table in Kató to make them match For Kató, the URLs are https://kato.translatorswb.org/... We've not entirely completed the move to that URL yet, but that only affects the integration with the Workspace API (which still expects http://ts.translatorswithoutborder.org)

alanbarrett commented 7 years ago

@mirko33

I did a test with https://kato.translatorswb.org/api/new and subsequent https://kato.translatorswb.org/api/status?id_project=1442&project_pass=cb3fdd5362e1 I have to divide by the number of language pairs to get the correct word count, but it seems to work otherwise.

When I (manually) go to https://kato.translatorswb.org/analyze/project-1/1442-cb3fdd5362e1 I am brought to https://kato.translatorswb.org/login and cannot proceed. Can I be given access?

Alan.
mirko33 commented 7 years ago

I've now given alanabarrett0@gmail.com access. That access is limited to your own projects and projects created by TWB staff, as only users with a @ translatorswithoutborder.org email address have full access (ie. access to projects created by other organization).

Mirko

On Thu, Jun 22, 2017 at 4:23 PM, Alan Barrett notifications@github.com wrote:

@mirko33 https://github.com/mirko33

I did a test with https://kato.translatorswb.org/api/new and subsequent https://kato.translatorswb.org/api/status?id_project= 1442&project_pass=cb3fdd5362e1 I have to divide by the number of language pairs to get the correct word count, but it seems to work otherwise.

When I (manually) go to https://kato.translatorswb. org/analyze/project-1/1442-cb3fdd5362e1 I am brought to https://kato.translatorswb.org/login and cannot proceed. Can I be given access?

Alan.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TheRosettaFoundation/SOLAS-Match/issues/1276#issuecomment-310395552, or mute the thread https://github.com/notifications/unsubscribe-auth/AI4RLT1na9hkutRx2_0CHZaQ8pXflDcAks5sGnjigaJpZM4NxOoy .

-- Mirko Plitt Platforms www.translatorswithoutborders.org

[image: logo new website]

alanbarrett commented 7 years ago

@mirko33

Great! How does it know what "my own projects" are? I do not submit anything identifying in the post (it is anonymous). But I can now see the Analyse URLs.

Alan.
mirko33 commented 7 years ago

Good point :-) The API uses a default user name, translated_user. Please use the owner_email parameter when creating the project. I suppose we should at least be refusing anonymous calls and those not made in the name of a registered user. As to the value of the owner_email parameter, I suppose we could use/create a generic email account for Rosetta/Trommons.

alanbarrett commented 7 years ago

I have just now (commit 151340cdefbfe128ca5249edf8f1fc67eb79ff81) set the owner_email to "info@trommons.org". Will this be accepted and will I be able to see the Projects in Kató?

I am not yet updating the Trommons word count, I want to wait and see if all goes well.

Alan.
alanbarrett commented 7 years ago

Here are the projects so far today... +------------+--------------------+-------------------------+-----------------+-------------------------+-----------------+--------------------+-------+ | project_id | matecat_id_project | matecat_id_project_pass | source_language | target_languages | user_word_count | matecat_word_count | state | +------------+--------------------+-------------------------+-----------------+-------------------------+-----------------+--------------------+-------+ | 7038 | 1460 | a98252f4285d | pt-BR | en-IE | 1406 | 1426 | 2 | | 7039 | 1461 | 3b8eb8454db4 | it-IT | en-US | 833 | 835 | 2 | | 7040 | 1462 | 36e5dc2859be | en-US | it-IT,hr-HR,ru-RU,de-DE | 401 | 404 | 2 | | 7041 | 1463 | 51781c9cd484 | en-US | it-IT | 540 | 551 | 2 | | 7042 | 1464 | a05b0dd42111 | en-US | it-IT | 538 | 543 | 2 | | 7043 | 1465 | 5943806ad6d6 | en-US | it-IT | 450 | 432 | 2 | | 7044 | 1466 | 7c3b3b052436 | en-US | it-IT | 478 | 481 | 2 | | 7045 | 1468 | 3b6ec2ee5a4b | fr-FR | en-US | 193 | 204 | 2 | | 7046 | 1469 | b75c78d8152d | en-GB | fr-FR,de-DE,es-ES | 630 | 631 | 2 | | 7047 | 1474 | 87aa3b871098 | ca-ES | en-GB | 1301 | 1364 | 2 | | 7048 | 1477 | d9d414c9b495 | ca-ES | en-GB | 1176 | 1250 | 2 | +------------+--------------------+-------------------------+-----------------+-------------------------+-----------------+--------------------+-------+

See e.g. https://trommons.org/project/7040/view/ and https://kato.translatorswb.org/analyze/proj-7040/1462-36e5dc2859be

The user_word_count and matecat_word_count are broadly compatible (so the organisations were careful with their estimates in these cases).

Remember I am not yet overwriting the user word count in the project and it's tasks with the word count from MateCat. I want to see how things go and maybe discuss adding wording to Trommons.

Alan.
mirko33 commented 7 years ago

"info@trommons.org" should be fine. I don't see any project created with that owner yet, so far they've all been stored as created by 'translated_user' I don't think that setting "info@trommons.org" as owner will have an impact on your ability to access the project page, but I'm not entirely sure. We've not been using the API much yet.

alanbarrett commented 7 years ago

4 new projects have now successfully used 'owner_email' => 'info@trommons.org' e.g. https://kato.translatorswb.org/analyze/proj-7049/1481-125517f3feaa (and I can access)

Alan.
mirko33 commented 7 years ago

Excellent, thanks! On the Kató end, they still show as 'translated_user' (and while they do show in the database, they're not visible in the MateCat management panel), but that's a) not much a problem for now, and b) entirely a MateCat issue (possibly API configuration which is not documented rather than a bug) which we will look at some point

alanbarrett commented 7 years ago

Here is what I will do (following on from our discussions)

These will not all be done at once.

Alan.
alanbarrett commented 7 years ago

Just to note we will not be able to do this test anymore... "If word count is greater than 5000, and segmentation is not selected, display warning message to user."

Alan.
mirko33 commented 7 years ago

German translation of "This will be calculated ...": "Die Wörter werden automatisch gezählt. Die Wortzahl wird nach zwei Minuten auf der nächsten Seite angezeigt." (I simplified "the page you reach after submitting this form" to "the next page", mainly because a correct German translation would probably only end up confusing most users)

alanbarrett commented 7 years ago

This I believe is now complete. Projects their Tasks should be updating with from Kató.

See commit: 9c555fe2cac81568e5e79100fabf92ff592b5345

I will watch to see that everything goes well. Let me know if there are issues.

Alan.