OpenITI / corpusbuilder

Corpus Build OCR platform
GNU Affero General Public License v3.0
2 stars 2 forks source link

Improve the result of preprocessing of the PDF files. #1

Open kamilc opened 5 years ago

kamilc commented 5 years ago

Currently, the individual pages that come from the processing pipeline end up being much worse in quality than what they look like in the original, source PDF files.

kamilc commented 5 years ago

@MatthewThomasMiller To answer your question:

How difficult would it be to enable CB to do the extraction of PDFs automatically?

It does the automatic extraction right now. It's just that it does it suboptimally. I'd need to do a little research for other solutions and then to implement it. It feels like a couple of hours worth of work.

MatthewThomasMiller commented 5 years ago

Please proceed with researching a better solution to the automatic extraction and then let us know the number of hours it would take. Most scholars will upload PDFs, so figuring out how to ensure optimal high quality in this extraction process is very important.

maximromanov commented 5 years ago

Kamil, have you checked the tool for processing PDFs? That one definitely preserves the quality.

On Wed, Mar 20, 2019, 4:04 PM MTMiller notifications@github.com wrote:

Please proceed with researching a better solution to the automatic extraction and then let us know the number of hours it would take. Most scholars will upload PDFs, so figuring out how to ensure optimal high quality in this extraction process is very important.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenITI/corpusbuilder/issues/1#issuecomment-474871951, or mute the thread https://github.com/notifications/unsubscribe-auth/AGYvicibeboGzFVeV-8dgXKvosn-OwtSks5vYk3egaJpZM4b6QHK .

kamilc commented 5 years ago

@maximromanov which tool do you have in mind? (I think the comment got cut a bit)

maximromanov commented 5 years ago

I wrote in one of the emails before. pdfimages seems to produce very good results results, see: https://askubuntu.com/questions/117143/command-line-tool-to-bulk-extract-images-from-a-pdf /

Command example: pdfimages -png PathToPdf.pdf subfolder/filesPrefix


Dr. Maxim Romanov, PhD in Near Eastern Studies (2013, U Michigan) W: https://maximromanov.github.io/ https://maximromanov.github.io/ | E: romanov.maxim@gmail.com • Universitätassistent für Digital Humanities, Institut für Geschichte Universität Wien | Universitätsring 1 | 1010 Wien | E: maxim.romanov@univie.ac.at W: http://ifg.univie.ac.at/en/about-us/staff/digital-humanities/maxim-romanov/; • Senior Research Fellow at “Knowledge, Information Technology, and the Arabic Book” (KITAB, an ERC-Project), led by Prof. Sarah Savant | Aga Khan University, ISMC (London)

On Mon, Mar 25, 2019 at 12:43 PM Kamil Ciemniewski notifications@github.com wrote:

@maximromanov https://github.com/maximromanov which tool do you have in mind (I think the comment got cut a bit)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenITI/corpusbuilder/issues/1#issuecomment-476161378, or mute the thread https://github.com/notifications/unsubscribe-auth/AGYviTk35F6FzK7mBZwcIoD4D5gZrrPwks5vaLZ8gaJpZM4b6QHK .

kamilc commented 5 years ago

Thanks @maximromanov I've just confirmed that it's going to work for us pretty well. I'm now working on the integration. It should take between an hour or two.

maximromanov commented 5 years ago

Great!


Dr. Maxim Romanov, PhD in Near Eastern Studies (2013, U Michigan) W: https://maximromanov.github.io/ https://maximromanov.github.io/ | E: romanov.maxim@gmail.com • Universitätassistent für Digital Humanities, Institut für Geschichte Universität Wien | Universitätsring 1 | 1010 Wien | E: maxim.romanov@univie.ac.at W: http://ifg.univie.ac.at/en/about-us/staff/digital-humanities/maxim-romanov/; • Senior Research Fellow at “Knowledge, Information Technology, and the Arabic Book” (KITAB, an ERC-Project), led by Prof. Sarah Savant | Aga Khan University, ISMC (London)

On Mon, Mar 25, 2019 at 1:10 PM Kamil Ciemniewski notifications@github.com wrote:

Thanks @maximromanov https://github.com/maximromanov I've just confirmed that it's going to work for us pretty well. I'm now working on the integration. It should take between an hour or two.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenITI/corpusbuilder/issues/1#issuecomment-476169029, or mute the thread https://github.com/notifications/unsubscribe-auth/AGYviQ2UhD_Qkoy1LYUtQtnp4DmOtcQQks5vaLyfgaJpZM4b6QHK .

MatthewThomasMiller commented 5 years ago

Great, please go ahead and do that!

Thanks, Matt

On Mon, Mar 25, 2019 at 8:17 AM Maxim Romanov notifications@github.com wrote:

Great!


Dr. Maxim Romanov, PhD in Near Eastern Studies (2013, U Michigan) W: https://maximromanov.github.io/ https://maximromanov.github.io/ | E: romanov.maxim@gmail.com • Universitätassistent für Digital Humanities, Institut für Geschichte Universität Wien | Universitätsring 1 | 1010 Wien | E: maxim.romanov@univie.ac.at W: http://ifg.univie.ac.at/en/about-us/staff/digital-humanities/maxim-romanov/ ; • Senior Research Fellow at “Knowledge, Information Technology, and the Arabic Book” (KITAB, an ERC-Project), led by Prof. Sarah Savant | Aga Khan University, ISMC (London)

On Mon, Mar 25, 2019 at 1:10 PM Kamil Ciemniewski < notifications@github.com> wrote:

Thanks @maximromanov https://github.com/maximromanov I've just confirmed that it's going to work for us pretty well. I'm now working on the integration. It should take between an hour or two.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/OpenITI/corpusbuilder/issues/1#issuecomment-476169029>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AGYviQ2UhD_Qkoy1LYUtQtnp4DmOtcQQks5vaLyfgaJpZM4b6QHK

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenITI/corpusbuilder/issues/1#issuecomment-476171285, or mute the thread https://github.com/notifications/unsubscribe-auth/AMcwzcZt6nmgKa1QEWgE6xxdEao3WknXks5vaL5rgaJpZM4b6QHK .

kamilc commented 5 years ago

@MatthewThomasMiller @maximromanov it took me longer than I expected because of how it fits in the whole pipeline. I needed to do a lot of testing (also added unit tests to make sure my code changes don't break anything). I'm doing the final tests in camp1 and will let you know once this is ready to be reviewed.

kamilc commented 5 years ago

@MatthewThomasMiller @maximromanov it turned out the problem was two-fold:

I've addressed both and it can be reviewed in https://camp1.openiti.org/documents/16 (you have to be logged in)

MatthewThomasMiller commented 5 years ago

Great!

@demahoney, can you run that document again that was being compressed and distorted and see if the issue is fixed now?

kamilc commented 5 years ago

@demahoney the better pdf processing is only present in camp1 for now: https://camp1.openiti.org You'd need to test over there.

demahoney commented 5 years ago

Dear Matt and Kamil,

I tried to run it on the camp1. But after uploading the document (bar and flashing circles), which was very fast, it did not bring me to a new page where I can select which language to OCR and then have it run. It just stayed on the upload page. I tried this 3 times (test 2-4), but wasn't sure how to find that next step in the process. "Save and continue" and "Save and edit" did not help. I also checked against the original trainer video to make sure I was taking the correct steps and not missing any other inputs.

Please let me know what to do.

Thanks, Dan

kamilc commented 5 years ago

@demahoney Could you please share the file you tried to upload with me so that I could reproduce the issue? I just tried and all went well (meaning - I wasn't able to reproduce).

Here's the screencast of me OCR'ing the testing (Syriac) PDF document: https://transfer.sh/YTrMj/test-camp1.mov

Here's the resulting document: https://camp1.openiti.org/documents/23

demahoney commented 5 years ago

Before sending the document to you, I wanted to try using the camp1 site again to see if it would work. This is what I encountered:

1) When I clicked on the documents tab, this error message popped up:

DataTables warning: table id=document-datatable - Ajax error. For more information about this error, please see http://datatables.net/tn/7

2) Although nothing showed up listed in the Documents section, I was still able to click on "New Document". I filled out the form again. This time with the name ~"April5Test_al-JanadiPart1". I uploaded okay, but it still remained on the same page, i.e. it did not load the new page I would make the selections for the OCR to run.

3) As an experiment I clicked on, the Save and Continue button, and then it took me to this error:

RSolr::Error::ConnectionRefused at /admin/documents Connection refused - {:data=>"[{\"id\":\"Document 24\",\"type\":[\"Document\",\"ActiveRecord::Base\"],\"class_name\":\"Document\",\"name_s\":\"April5Test1_al-Janadi_Part1\",\"language_id_i\":\"1\",\"id_im\":\"24\",\"user_id_i\":\"5\",\"contributor_ids_im\":\"5\",\"document_type_id_i\":\"1\",\"sort_author_s\":\"Mahoney\",\"gregorian_date_d\":\"2019-04-05T11:57:12Z\",\"lunar_hijri_date_d\":\"1440-08-01T11:57:12Z\",\"published_b\":\"false\",\"title_text\":\"April5Test1_al-Janadi_Part1\",\"source_name_text\":\"\",\"publisher_text\":\"\",\"summary_text\":\"\",\"body_text_text\":\"\",\"language_text\":\"Arabic\",\"contributors_text\":\"Daniel Mahoney\",\"document_type_text\":\"Book\"}]", :headers=>{"Content-Type"=>"application/json"}, :method=>:post, :params=>{:wt=>:json}, :query=>"wt=json", :path=>"update", :uri=>#<URI::HTTP http://localhost:8982/solr/default/update?wt=json>}

This was a above a context of application frames opening up on the left and right containing admin notes.

So, in any case, I've attached the pdf document here.

I feel like these errors are not related the pdf. But of course I could be wrong. Please let me know if I'm taking in correct steps. I'm just confused now.

Cheers, Dan al-Janadi_Kitab al-suluk fi tabaqat al-'ulama wa al-muluk_Part 1.pdf

kamilc commented 5 years ago

@demahoney I was rolling out the latest changes to the production instance. Could you try again now?

demahoney commented 5 years ago

@kamilc Yes, give a couple of minutes.

demahoney commented 5 years ago

@kamilc

  1. The same "table" error is coming up in the Documents tab.

  2. When I drop the pdf file into the upload box, it directly opens the pdf viewing, not uploading.

demahoney commented 5 years ago

@kamilc

I also just tried with a different pdf (on a different file name). And it just opens up the pdf for a viewing, not uploading.

kamilc commented 5 years ago

@demahoney Thank you. I'll need to reproduce it and since it works just fine in my browser - would you be able to record your screen while you try to upload the file? If you're on a Mac - you could use the built in QuickTime app for that. Please let me know if you'd have trouble with it. Thank you again

kamilc commented 5 years ago

@demahoney oh please wait - I can see something

demahoney commented 5 years ago

@kamilc I've now learned how to record the screen on my Windows machine so let me know.

kamilc commented 5 years ago

@demahoney Good :) I think though that I might have just fixed what I broke during the production rollout (as an off-shoot of the rollout I mean). Could you please try once again?

demahoney commented 5 years ago

@kamilc Yes, the other errors have gone away now. :) But the OCR page still does not automatically load after the uploading is complete. It just remains on the upload page with the thumbs up emoji. I will record this for you now.

kamilc commented 5 years ago

@demahoney Thank you - could you also provide me with the browser you're using - along with the video?

demahoney commented 5 years ago

@kamilc Github wouldn't let me upload mp4 file type, and it was too big for email so here is a dropbox link:

https://www.dropbox.com/sh/zw7op5m2adljty9/AAAxPKECf_r9hZxM3h2Lv40Ja?dl=0

demahoney commented 5 years ago

@kamilc It's Firefox. The video capture only records my primary application so you don't see the windows folder from which the file comes. But all that you missed is me dropping the file from the folder into the corpusbuilder.

kamilc commented 5 years ago

@demahoney I'll be out of the office traveling for the next couple of hours. Could you give Chrome a try? I'll need to try Firefox myself too and potentially fix whatever's not working over there. Thank you again for all the help!

demahoney commented 5 years ago

@kamilc I tried it Chrome, and it had the same (non-)result. file: april5_ChromeTest1_al-JanadiPart1

demahoney commented 5 years ago

@kamilc I also tried in Chrome with a different pdf (part 2 of al-Janadi) file: April5_ChromeTest2_al-JanadiPart2

Same result.

kamilc commented 5 years ago

@demahoney This is really puzzling... I was able to use Firefox and OCR the document you've shared: https://www.openiti.org/documents/26

Is it possible that there are some browser extensions you're using that get in the way? Could you try using the incognito mode?

demahoney commented 5 years ago

@kamilc @MatthewThomasMiller

(1) I tried al-Janadi_Part1 in Firefox Privacy Mode with the same result of the OCR page not loading after the document was uploaded with a thumbs-up emoji.

(2) I tried al-Janadi Part 1 in Chrome Incognito Mode with the same result.

(3) I tried al-Janadi Part 2 in Chrome Incognito Mode with the same result.

(4) I tried al-Razi (new document) in Chrome Incognito Mode with same result.

I don't think it is the pdf that is the issue. Nor the browser. I'm not sure what it could be.

(1) This specific issue did not happen on the main openiti site when I uploaded documents, (obviously because I was ocring on that site). Is there something different with the camp1 site than the main openITI site?

(2) Have other people tried it successfully other than you? What is the testing situation like right now -- are other people using this right now? On the camp1 or main OpenITI site?

kamilc commented 5 years ago

@MatthewThomasMiller were you able to reproduce the issue Dan's reporting? I haven't been able to at no point :(

@demahoney I've already used lots of your time and I'm very thankful you're devoting it to help here. Would it be possible for you to try again but this time with opened DevTools in Chrome? I'd be interested in how the "Console" tab looks like when that issue happens and how the "Network" as well.

Again – Many Thanks for your help!

kamilc commented 5 years ago

You can open the DevTools by right-clicking the page and choosing "Inspect" from the context menu

demahoney commented 5 years ago

@kamilc @MatthewThomasMiller

https://www.dropbox.com/sh/zw7op5m2adljty9/AAAxPKECf_r9hZxM3h2Lv40Ja?dl=0

You can get my video (April 16) from the dropbox link.

In the video I forgot to select a contributor for the upload. But essentially the same thing happened. I don't know if anything showed up on devtools.

demahoney commented 5 years ago

@kamilc @MatthewThomasMiller

I also just put another video in the dropbox folder that has the devtools open and I selected a contributor for the upload this time. So I don't think I made any mistakes. Please let me know and I'll try again.

kamilc commented 5 years ago

@demahoney what I needed to see in the DevTools was the "Console" tab and the "Network". So after opening it up, you'd need to switch from looking at the "Elements" tab by clicking on "Console".

demahoney commented 5 years ago

@kamilc No problem with asking me to do this. This process is very interesting to me. It's not just about wanting to be able to OCR my texts. :) I'm also learning some computer programming, and at the moment I've recently finished learning HTML, CSS, some PHP and now getting into JavaScript. (I've been working with Python for some time.) I'm only beginning, but being part of this project connects me to what real developers do.

demahoney commented 5 years ago

@kamilc okay. I will do that now.

kamilc commented 5 years ago

@demahoney Thank you. If you'd ever have any questions while learning to code — please feel free to reach me out any way you'd like :)

demahoney commented 5 years ago

@kamilc

![Screenshot (2)](https://user-images.githubusercontent.com/36761697/56213084-a949fe80-605b-11e9-98b8-a30f46ed3764.png

Before doing the recording here is a screenshot of that section before doing another upload. If it doesn't show up on here I'll put it in the dropbox folder.

demahoney commented 5 years ago

@kamilc New video has uploaded. I'll hang out here for about the next 20 minutes, if you want me to do anything else.

demahoney commented 5 years ago

@kamilc

2nd screencapture

Screenshot (3)

kamilc commented 5 years ago

@demahoney this tells me something. Thank you again. I'll troubleshoot it very soon

demahoney commented 5 years ago

@kamilc

screencapture of network tab from most recent upload

Screenshot (4)