NRGI / resourcecontracts.org

Resource Contracts
http://resourcecontracts.org
GNU General Public License v2.0
16 stars 9 forks source link

Develop mechanical turk workflow for correction of OCR-ed PDFs #73

Closed anderspeders closed 8 years ago

anderspeders commented 9 years ago

Potentially for sprint 3

anderspeders commented 9 years ago

The purpose of this task will be to take OCR-text which is not good enough quality, and route this text through Mechincal Turk to improve the quality of the text.

This will for example take care of spelling errors and other errors which have occured during the OCR-process.

This will be deployed as part of the workflow in the step after the human review of the ABBYY OCR.

If the OCR is considered yellow or red, the text will be sent via the AWS API to mechnical turk where it will be solved as a task.

As an uploader I want to receive an alert or update once the text has completed the mechanical turk human review of the OCR.

anderspeders commented 9 years ago

Suggest we put this issue forward during sprint 3.

anderspeders commented 9 years ago

Some examples of descriptions of workflow here: http://journal.code4lib.org/articles/6004

and here: https://tonywebster.com/2013/09/mechanical-turk-ocr-paper-reports/

@anjesh If you need directions for this, I suggest we schedule a time to connect in slack. Does backlog mean that this is not targeted under sprint 3?

anjesh commented 9 years ago

No Backlog doesn't mean that. We are exploring on that front and when we have sufficient info to go ahead, will move that to ready stage. Thanks for those links.

anjesh commented 9 years ago

Can you create an requester account https://requester.mturk.com/developer/sandbox, it says that it can be registered from US only. We can do the testing on the sandbox to see how it works. I am also trying to see how http://www.cloudfactory.com could be helpful, they are based in Nepal. Do you think it makes sense to approach them or shall we just try mturk first?

cc @byndcivilization

anjesh commented 9 years ago

Proposed workflow for sending the Manual Transcription tasks to the MTurk. Once the pdf is tagged as "Human Transcription required", it is sent to the "HIT Tasks Management System", separate from "ResourceContracts system". Each page is created as task, one sent to Mturk as HIT and other is listed in the System. When the worker does the task, we could either use notification system of Amazon to find out that the task is done for the given page. The research will go through the transcribed text and take action (either approve or reject) based on the result. Approval will then result in the payment to the worker.

image

When the pdf is sent to the "HIT Tasks Management system", it will be sent to Mturk and also listed in the system. In the page we can see the number of tasks (which equals to number of pages, how many of them are completed, and how many are approved or rejected). If all the tasks are completed, then there will be option to send those transcribed text to the resource contracts system. If already sent, the user can still send it to the resource contracts system, which will override the existing text in the system if any.

image

The user clicks on the contract and sees the status of individual pages. If the task is completed, then there will be option to review, approve or reject the work.

image

The user sees/reviews the individual task (transcribed text) and either approves or rejects. Rejection will require reasons for rejection to the workers.

image

Thoughts? @anderspeders

anjesh commented 9 years ago

Moving to Sprint 4. Some research has been done - HIT task creation, approving, rejecting, getting assignments by API tested. Still need some discussion.

anderspeders commented 9 years ago

Hi @anjesh,

Thanks for this documentation. This is really, really helpful to see it descriped in this way.

As I understand HITs will be at a page-by-page basis, correct?

As for testing contracts I suggest that we start with OLC - for example this contract: https://www.dropbox.com/sh/cvsgi110hm7pxlq/AACmldG6qiIQpZHMYUpiIrbIa/Sierra%20Leone/Goldtree%20Contract.pdf?dl=0

Full repository here: https://www.dropbox.com/sh/cvsgi110hm7pxlq/AADEho80LdFFT7kQOqRrXvbOa?dl=0

anderspeders commented 8 years ago

Unfortunately we are still unclear about these processes and contributors are being caught in dead workflows.

Please:

We have lost several weeks of testing and transcription opportunities and absolutely need to get an operating model going as soon as possible:

This ticket will remain open at priority level until we complete the transcription of at least three contracts.

anderspeders commented 8 years ago

Confirmed today that @anjesh will develop notification email system for requesters (contract uploaders).

anjesh commented 8 years ago

Closing. Mturk is working fine now. Notification system for Mturk has new issue https://github.com/NRGI/resourcecontracts.org/issues/209