Closed milahu closed 7 months ago
Hi milahu 🙂
I'm not sure I fully grasp what your project is trying to achieve.
From what I understand you have a subtitle repository that fetches and store subtitles from opensubtitles that you could then use as a privetly hosted subtitle repository.
And you would like to build a ad-removal plugin for the subtitle repository. Is the idea to remove ads in the scraping process or on request?
My script is highly procedural since it runs a lot of regex and also does several passas checking for contextual clues that could be used to identify ad-blocks. This makes it quite robust against adjustments since the ad detection is part regex and part logic.
However, my script doesn't look across more than the content of the subtitle that it's working on.
I think that if you did ad detection across a lot of subtitles, some subtitles would even be the same movie. You could get really accurate about finding ads purely from pattern recognition and statistical analysis.
So the methodology would be entirely different from what my project does. You might be able to use the regex as phrases that we expect to appear as detected ads in order fo find false negatives.
But that would be an entirely different project from what I currently have. If done well it would outperform my project since it makes decision based om more data and would also rely less on manually crafted regex.
Im not sure how much I'd be able to help though. Since that would probably be a lot of data exploring and trying to figure out good statistical signs and make sure the performance isn't tanking. I guess one could use my script as a validator and view what they don't agree on.
Unfortunately I have a very different amount of available time for projects like these compared to a few years back. At the moment I mostly maintain this project. Some regex changes, maybe some logic tweeting here and there.
Improving the performance of my tool would most likely mean a rewrite into something more performant. Java is the one I have experience with. But that would also require quite a bit of time.
I'd happily collaborate on something like this, but my time is very limited so id expect that you'd have to do the vast majority of the work. I might be able to dedicate more time during the winter. But for me this is a bad time to begin a new project. I'll happily provide feedback and help out if you get stuck though. But calling that a collaboration is a bit of a stretch 😅
If you want i could add a flagg that you could set that would export the data gathered during cleaning in a more api-friendly manner. But you'd have to request what data you'd like to have and how it should be formated.
I'd also prefer to have this discussion under the discussion section and not as an issue. And maybe if you could clarify there exactly what you envision the solution to ad-detection in your project would be.
/ KBlixt
wall of text! ^^
im really just here to propose a shared test corpus
ideally, different subcleaners should pass all tests in that test corpus and users should contribute new test cases to that shared test corpus (users want clean subs, we just provide the tools)
Ok, the regex I have produced for my tool would be a good start for you as well then, especially the purge sections since those are generally dead give-away phrases. If you find any other regex feel free to let me know but I need to keep the format that already exist in my language profiles.
If you haven't looked at the regex yet they are found under /language_profiles/default and are designed to be replaced/complemented if the user wants to do so. For me it was a key design aspect they weren't hardcoded or anything.
hey : )
i started my opensubtitles_adblocker.py about a month ago as a part of my opensubtitles-scraper project since i could not find any adblocker for subtitles... not sure why i did not find this project earlier, i just found it via reddit
one difference: my adblocker works on raw bytes, because that is faster and because sub files can have broken encoding for example utf8 and latin1 can appear in one file
for my opensubtitles_adblocker_add.py i have forked pysubs2 to pysubs2bytes so i can parse subtitle files into raw bytestrings
lets collaborate? we could build a test corpus for ads in subtitles based on the 6 million subs from opensubtitles.org i would use one large srt file to store all the unwanted subs and from that srt file, we could derive other subtitle formats that srt would have a mix of many different encodings so it would not be possible to read it as text file
why srt? srt is the most popular format = 90% of all subs frequency of subtitle formats in opensubtitles.org