C0D3D3V / Moodle-DL

Moodle-DL downloads course content fast from Moodle (eg. lecture pdfs)
GNU General Public License v3.0
442 stars 63 forks source link

Add Regex filtering for black and whitelists #183

Open Udinanon opened 1 year ago

Udinanon commented 1 year ago

Description of the problem

I want to filter files being downloaded based on the URLs in detail, to avoid downloading unwanted content such as videos My university hosts a lot of content by itself, so I can't just do a domain filtering, i need to filter the URLs

Solution

Regex would be a very flexible tool to do it and would expand the capabilities of the tool while using a very common technology

I want to add a download_domains_blacklist_regex to the options, i think that's all in config.py Then a small edit to the function is_filtered_external_domain in task.py, checking the full URLs against this Regex list, and then filtering by hostname using the old version of the blacklist I might have to also add the option properties in the types.py dataclass

This should be all the needed steps to add the option correctly, although some of this is based on notes from a previous version of the library I'll likely start working on a PR in the next few days, please tell me if anything here is incorrect or inaccurate

Alternatives

Perhaps the new filtering could cover the previous domain method, but that would break backwards compatibility unnecessarily Other text schemes are possible, but Regex is extremely popular and well implemented in Python

C0D3D3V commented 1 year ago

The idea sounds fine to me, and was requested multiple times before. I think a lot of people will like it. And you are welcome to implement it :)

But please tell me more what kind of content you want to filter and how it is listed in your moodle? E.g. how does the URL pattern looks like you want to filter?

This filter is definitely good, but I'm not sure if it will do what you want :D for example: https://github.com/C0D3D3V/Moodle-DL/blob/0bfc7596c74aaf7923e82d7c9685c74a3aa81a5c/moodle_dl/moodle/result_builder.py#L87

There is no filter for LTI Modules, like kalvidres, helixmedia, opencast ... these will not be filtered by these patterns because they are "cookie mods" (maybe someday someone implements them as normal mod) and handled here: https://github.com/C0D3D3V/Moodle-DL/blob/0bfc7596c74aaf7923e82d7c9685c74a3aa81a5c/moodle_dl/downloader/task.py#L723

But anyway this regex filter is a good idea, and more filters for other parts of moodle-dl will come eventually.

Udinanon commented 1 year ago

Well, what i wanted to filer was exactly kalvidres content , so this is a bit of a bummer Perhaps we could also filter requests going inside external_download_url directly, applying the same blacklist/whitelist?

C0D3D3V commented 1 year ago

We could also add a regex filter to the end of filter_courses in moodle_service.py I do not really like the idea but we can do it. You can do that instead of adding it to the external filter only. So all urls will be filtered.

C0D3D3V commented 1 year ago

We will also somday add filters for modules, so that kalvidres module can also be filtered

C0D3D3V commented 1 year ago

If you add a filter to filter_courses in moodle_service.py you can filter for r'https://your.moodle.com/mod/kalvidres/view.php\?.*' that should do the trick. Adding a filter option only for kalvidres module will take some more time, because I want to thing about a good solution, that we can apply to all mods

C0D3D3V commented 1 year ago

In principle related to #135