LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.14k stars 869 forks source link

more flexible searching by post URL, partial URL searching #4748

Open Die4Ever opened 3 months ago

Die4Ever commented 3 months ago

Requirements

Is your proposal related to a problem?

Currently if I want to search if something has been posted already, I have to get the URL an exact match, but URLs can vary in small ways.

Youtube example:

https://www.youtube.com/watch?v=pd5iofvLrIU

https://youtu.be/pd5iofvLrIU

https://youtu.be/pd5iofvLrIU?si=QO2iK1Zw0Z7NDRHo

Describe the solution you'd like.

If I could search by a part of the URL, like just pd5iofvLrIU, then it would find them all. I guess it would be a text index with wordstops like :, /, ?, =, and &. This could probably even happen automatically for known domains as suggested/lower search results. This could also be used to search by domain name.

Also when making a post, currently it does searches by title which helps reduce reposts or get people to crosspost instead of repost, it would be cool if it also did that automatic search by the URL.

Describe alternatives you've considered.

This could maybe be a plugin?

Additional context

No response

Die4Ever commented 3 months ago

some related issues

https://github.com/LemmyNet/lemmy-ui/issues/1922

https://github.com/LemmyNet/lemmy-ui/issues/154

https://github.com/LemmyNet/lemmy/issues/2542

dessalines commented 3 months ago

I don't see how we'd be able to do this, since there are infinite numbers of variations of these urls, and different definitions as to what constitues a match, and what doesn't.

Die4Ever commented 1 month ago

It might also be good to normalize YouTube URLs since there's a lot of different formats they can end up in and that breaks crosspost detection. Probably the only query params that need to be respected are the video ID, playlist, index, and timestamp? The si query parameter can be thrown away, I think it's just a tracking ID for sharing.

Nutomic commented 2 weeks ago

We could handle this by resolving the post url and following redirects, eg curl -Ls -o /dev/null -w %{url_effective} "https://youtu.be/pd5iofvLrIU?si=QO2iK1Zw0Z7NDRHo". This gives https://www.youtube.com/watch?si=QO2iK1Zw0Z7NDRHo&v=pd5iofvLrIU&feature=youtu.be which isnt fully normalized, but the advantage is that it works for all websites.

Otherwise we would need a Rust library to normalize Youtube urls, but I couldnt find that on crates.io.

dessalines commented 1 week ago

That one would be handled when we eventually add the clearurls crate, as discussed in #4905