Closed pgalinanes closed 1 year ago
Thank you for the heads up. I´ve started working on this on the blank_pages branch.
The problem had to do with the MediaBox (You can think of it as the position of the "Camera" rendering the PDF). When we clean a PDF, we need to move the MediaBox back on top of the document, otherwise we just get a blank page.
Here is a simple diagram explaining the bug:
Excelent, thank you for your explanation. How do we know where's the content in each page?
There are at least two ways of doing it:
The easiest one is using the MediaBox of the original document. If It's a valid usable document, the camera will always be directly over the content, and we can use that information to move the new MediaBox accordingly.
If for whatever reason we lost the information of the original MediaBox, we could always analyse the \Contents dictionary (Which contains the commands that draw the content of each page) and see where we are drawing the content.
But this second method would only be useful to fix broken PDFs or something similar, in every other case, using the information of the original MediaBox is prefered.
After installing it manually (copying gulagcleaner folder contents into python lib location) i'm getting the same output, a blank pdf. I've checked the version with the parameter -v and it states 0.8.0, so i think i'm in the correct version. If i use the web, i can clean the docs succesfully, but when using the CLI I get the same response with different pdf's. How do you test the CLI version? Any advice?
When using the web version hosted on https://angeloyo.github.io/gulagcleaner.com/ I'm getting the same result, a blank pdf with multiple pages, but when using https://gulagcleaner.com/ I'm getting the pdf's cleaned correctly
I just updated the version on pypi and everything should be working
Try installing the latest version (0.8.1 right now) and tell me if it gives you any errors.
pip install gulagcleaner --upgrade
I right now test the CLI manually, because I still haven't thought of any way to check if a pdf is "clean" appart from checking manually. Please, let me know if you find a better way because it would be very nice to have automated testing.
Works flawlessly! I've thought of how to test the cli version on a github actions repo, let me some time and I'll came with a testing repo
I've built a python script that checks if the cleaner works. First of all, there are three folders: og_cleaned, new_cleaned and new_dirty.
If they have the same number of words, images and pages, then assumes they are the same so returns a 1 as output. If they are not the same, it returns a 0 and a little phrase explaining what document is different.
The pdf's located in new_dirty and new_cleaned are always the same. The script only modifies the files located in new_cleaned
https://github.com/pgalinanes/Gulag-Tester
To run it you'll need the folders with pdfs, i think i can't upload it here for copyright reasons
When using gulagcleaner via CLI I'm obtaining a blank pdf as a result. I tried in a freshly installed linux and got the same thing. Any idea why?