KBlixt / subcleaner

removes ads from subtitle files cleanly.
288 stars 13 forks source link

hello from similar project - opensubtitles_adblocker.py #57

Closed milahu closed 7 months ago

milahu commented 7 months ago

hey : )

i started my opensubtitles_adblocker.py about a month ago as a part of my opensubtitles-scraper project since i could not find any adblocker for subtitles... not sure why i did not find this project earlier, i just found it via reddit

one difference: my adblocker works on raw bytes, because that is faster and because sub files can have broken encoding for example utf8 and latin1 can appear in one file

for my opensubtitles_adblocker_add.py i have forked pysubs2 to pysubs2bytes so i can parse subtitle files into raw bytestrings

lets collaborate? we could build a test corpus for ads in subtitles based on the 6 million subs from opensubtitles.org i would use one large srt file to store all the unwanted subs and from that srt file, we could derive other subtitle formats that srt would have a mix of many different encodings so it would not be possible to read it as text file

i would use one large srt file to store all the unwanted subs

why srt? srt is the most popular format = 90% of all subs frequency of subtitle formats in opensubtitles.org

$ sqlite3  subtitles_all.txt.gz.db "select SubFormat, count(1) FROM subz_metadata GROUP BY SubFormat" | sort -n -k2 -t'|' -r

srt|5774302
sub|317934
ssa|179597
mpl|40845
tmp|17883
smi|7547
txt|6545
vtt|1998
KBlixt commented 7 months ago

Hi milahu 🙂

I'm not sure I fully grasp what your project is trying to achieve.

From what I understand you have a subtitle repository that fetches and store subtitles from opensubtitles that you could then use as a privetly hosted subtitle repository.

And you would like to build a ad-removal plugin for the subtitle repository. Is the idea to remove ads in the scraping process or on request?

My script is highly procedural since it runs a lot of regex and also does several passas checking for contextual clues that could be used to identify ad-blocks. This makes it quite robust against adjustments since the ad detection is part regex and part logic.

However, my script doesn't look across more than the content of the subtitle that it's working on.

I think that if you did ad detection across a lot of subtitles, some subtitles would even be the same movie. You could get really accurate about finding ads purely from pattern recognition and statistical analysis.

So the methodology would be entirely different from what my project does. You might be able to use the regex as phrases that we expect to appear as detected ads in order fo find false negatives.

But that would be an entirely different project from what I currently have. If done well it would outperform my project since it makes decision based om more data and would also rely less on manually crafted regex.

Im not sure how much I'd be able to help though. Since that would probably be a lot of data exploring and trying to figure out good statistical signs and make sure the performance isn't tanking. I guess one could use my script as a validator and view what they don't agree on.

Unfortunately I have a very different amount of available time for projects like these compared to a few years back. At the moment I mostly maintain this project. Some regex changes, maybe some logic tweeting here and there.

Improving the performance of my tool would most likely mean a rewrite into something more performant. Java is the one I have experience with. But that would also require quite a bit of time.

I'd happily collaborate on something like this, but my time is very limited so id expect that you'd have to do the vast majority of the work. I might be able to dedicate more time during the winter. But for me this is a bad time to begin a new project. I'll happily provide feedback and help out if you get stuck though. But calling that a collaboration is a bit of a stretch 😅

If you want i could add a flagg that you could set that would export the data gathered during cleaning in a more api-friendly manner. But you'd have to request what data you'd like to have and how it should be formated.

I'd also prefer to have this discussion under the discussion section and not as an issue. And maybe if you could clarify there exactly what you envision the solution to ad-detection in your project would be.

/ KBlixt

milahu commented 7 months ago

wall of text! ^^

im really just here to propose a shared test corpus

ideally, different subcleaners should pass all tests in that test corpus and users should contribute new test cases to that shared test corpus (users want clean subs, we just provide the tools)

> what your project is trying to achieve scraping opensubtitles, to give free access to anyone > Is the idea to remove ads in the scraping process or on request? removing ads is done on the client side, after downloading the subs this is done to reduce cpu load on the server and also the adblocker is the most unstable part, because the blocklist will always change > My script is highly procedural since it runs a lot of regex i suspect that my solution is faster, because i use one large regex but performance has low priority when running this on 10...100 sub files 6 million subs would be different, but no one will do that, its too unstable > You could get really accurate about finding ads purely from pattern recognition and statistical analysis. please no "machine learning magic" ... some ads are simple to find, they are the very first or very last lines in the sub file and they have an above-average duration of 5 seconds or longer normal sub lines have only 2 or 3 seconds also patterns like `www.` or `.com` or `.org` are suspicious i use these `bad_words` in [opensubtitles_adblocker_add.py](https://github.com/milahu/opensubtitles-scraper/raw/main/opensubtitles_adblocker_add.py) to suggest new patterns for the blocklist but autodetection is too risky, so my adblocker has only regex patterns
KBlixt commented 7 months ago

Ok, the regex I have produced for my tool would be a good start for you as well then, especially the purge sections since those are generally dead give-away phrases. If you find any other regex feel free to let me know but I need to keep the format that already exist in my language profiles.

If you haven't looked at the regex yet they are found under /language_profiles/default and are designed to be replaced/complemented if the user wants to do so. For me it was a key design aspect they weren't hardcoded or anything.