LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.29k stars 882 forks source link

Add CleanURL Rules as a Submodule to be used in sanitizing URLs. #4905

Closed The-RedWizard closed 2 months ago

The-RedWizard commented 4 months ago

Requirements

Is your proposal related to a problem?

Currently, Lemmy will attempt to clean the URLs based on its own rules, instead I think it would be great if we could adopt the crowd soured rules created for the CleanURL extension. Considering Lemmy has such a large user base with a vested interest in scrubbing URLs from their respective platforms, we could contribute back to the CleanURL ruleset in a large way.

Describe the solution you'd like.

Per this thread conversation: https://hexbear.net/comment/5136579 I've created a pull request that adds the Rules repo under Modules\Rules as a submodule to the Lemmy repo. This could either be implemented as a default functionality, or an optional functionality for Lemmy Admins. Ultimately, I think it makes good sense to not reinvent the wheel when it comes to URL sanitization.

Describe alternatives you've considered.

Initially, I thought of solving this issue via a bot that either DMs the OP of a post or comments within the post containing the sanitized URL, but since there is already some level of sanitation happening, it feels right to put this directly within the backend.

Additional context

This is my first real pull request, so please let me know if I'm not following proper procedure, or if I misunderstood the conversation within that thread.

dessalines commented 4 months ago

This would be a great opportunity for someone to build a proper rust crate to do this, that could be used by many projects.

I created a lemmy post about this: https://lemmy.ml/post/18162485

jendrikw commented 3 months ago

I wrote something in the last couple days.

https://crates.io/crates/clearurls

Let me know if if fits your need or you need anything else. Issues and PRs welcome.

TheKindMrUlyanov commented 3 months ago

One issue that the ClearUrls rules may not cover are the links that exist in a rainbow-table and are obfuscated by default, such as reddit.com/r/sub/s/gibberish and vm|vt.tiktok.com/gibberish links. The easiest way to implement a fix for this is to obviously just open the URL first and see where it redirects to. The issue then becomes is who will be opening the URL? Is it the instance? If it's the instance, then there would be a very clear pattern and signal to these companies that there exists a network of users there because there is one consistent IP/group of IPs deobfuscating every single rainbow-table link on lemmy.

Just want to clarify that a 90%-there solution is better than no solution. It would be acceptable even if the aforementioned problem still exists.

The solution I used for the bot (from the thread this was linked from) is to just open the URLs, but the bot is hosted from the IP range of a major VPN provider so I hope that the organic traffic from the VPN users would disrupt any graph that companies would build.

dessalines commented 2 months ago

@jendrikw We'd be able to use the crate, but are dependent on https://github.com/jendrikw/clearurls/issues/3 , since our comments often contain links that also need to be stripped of tracking.