YM162 / gulagcleaner

Ad-removal tool for PDFs in Python, JavaScript and Rust.
http://gulagcleaner.com
GNU General Public License v3.0
85 stars 8 forks source link

StuCleaner removes hyperlinks #24

Open SamuelYLay opened 1 month ago

SamuelYLay commented 1 month ago

When I upload file to stucleanear/gulagcleaner its nice and removes all the watermarks but it removes all the hyperlinks. Especially from table of content which is useful :( Here you have both files, if you need them: math2988-lecture-notes.pdf math2988-lecture-notes-stucleaner.pdf

YM162 commented 2 days ago

Hi! Thanks for the heads up! It should be an easy fix.

Right now we remove all annotations, which are used for all the clickable elements (including the links in the watermarks), by setting them to an empty array on gulagcleaner_rs/src/models/method.rs line 115

There is probably a good way of filtering the "bad" annotations from the "good" ones, instead of removing all of them. This could be done either by checking the type of the annotation (URL,Intra-document?) or by using some regex for the studocu/wuolah urls.

I´ll try to fix it in a couple of weeks if I have the time, but if someone else wants to give it a go before then, go ahead :)