YM162 / gulagcleaner

Ad-removal tool for PDFs in Python, JavaScript and Rust.
http://gulagcleaner.com
GNU General Public License v3.0
98 stars 9 forks source link

Blank pages #7

Closed pgalinanes closed 1 year ago

pgalinanes commented 1 year ago

When using gulagcleaner via CLI I'm obtaining a blank pdf as a result. I tried in a freshly installed linux and got the same thing. Any idea why?

YM162 commented 1 year ago

Thank you for the heads up. I´ve started working on this on the blank_pages branch.

The problem had to do with the MediaBox (You can think of it as the position of the "Camera" rendering the PDF). When we clean a PDF, we need to move the MediaBox back on top of the document, otherwise we just get a blank page.

Here is a simple diagram explaining the bug: blank_pages

pgalinanes commented 1 year ago

Excelent, thank you for your explanation. How do we know where's the content in each page?

YM162 commented 1 year ago

There are at least two ways of doing it:

But this second method would only be useful to fix broken PDFs or something similar, in every other case, using the information of the original MediaBox is prefered.

pgalinanes commented 1 year ago

After installing it manually (copying gulagcleaner folder contents into python lib location) i'm getting the same output, a blank pdf. I've checked the version with the parameter -v and it states 0.8.0, so i think i'm in the correct version. If i use the web, i can clean the docs succesfully, but when using the CLI I get the same response with different pdf's. How do you test the CLI version? Any advice?

When using the web version hosted on https://angeloyo.github.io/gulagcleaner.com/ I'm getting the same result, a blank pdf with multiple pages, but when using https://gulagcleaner.com/ I'm getting the pdf's cleaned correctly

YM162 commented 1 year ago

I just updated the version on pypi and everything should be working

Try installing the latest version (0.8.1 right now) and tell me if it gives you any errors.

pip install gulagcleaner --upgrade

I right now test the CLI manually, because I still haven't thought of any way to check if a pdf is "clean" appart from checking manually. Please, let me know if you find a better way because it would be very nice to have automated testing.

pgalinanes commented 1 year ago

Works flawlessly! I've thought of how to test the cli version on a github actions repo, let me some time and I'll came with a testing repo

pgalinanes commented 1 year ago

I've built a python script that checks if the cleaner works. First of all, there are three folders: og_cleaned, new_cleaned and new_dirty.

If they have the same number of words, images and pages, then assumes they are the same so returns a 1 as output. If they are not the same, it returns a 0 and a little phrase explaining what document is different.

The pdf's located in new_dirty and new_cleaned are always the same. The script only modifies the files located in new_cleaned

https://github.com/pgalinanes/Gulag-Tester

To run it you'll need the folders with pdfs, i think i can't upload it here for copyright reasons