RSS-Bridge / rss-bridge

The RSS feed for websites missing it
https://rss-bridge.org/bridge01/
The Unlicense
7.06k stars 1.02k forks source link

Google Search bridge: don't return already found URLs #2263

Open siccovansas opened 2 years ago

siccovansas commented 2 years ago

The Google Search bridge is awesome, but I often get results that I've already seen. It would be great if there would be an option which keeps a list of already returned URLs so they won't be shown a second time.

dvikan commented 2 years ago

For some reason im only getting 3 results.

yamanq commented 2 years ago

This could likely be solved by setting the URL as the uid. Then, feed clients can keep track of them internally (as many do).

Bockiii commented 2 years ago

I think the "problem" here is google and it's AMP service. This is the result from the current bridge in rss:

    <item>
        <title>rss-bridge - bytemeta</title>
        <link>https://bytemeta.vip/repo/RSS-Bridge/rss-bridge/issues?after=Y3Vyc29yOnYyOpK5MjAyMi0wMi0xM1QwMTo1MDozNCswODowMM5DnPsU&amp;before=&amp;page=11</link>
        <guid isPermaLink="true">https://bytemeta.vip/repo/RSS-Bridge/rss-bridge/issues?after=Y3Vyc29yOnYyOpK5MjAyMi0wMi0xM1QwMTo1MDozNCswODowMM5DnPsU&amp;before=&amp;page=11</guid>
        <pubDate>Mon, 09 May 2022 00:00:00 +0000</pubDate>
        <description> rss-bridge repo issues. ... Bridge request for CBC Editor's Blog. doowruc. doowruc CLOSED · Updated 2 months ago · NASA APOD Bridge failed with error 0. </description>

    </item>

As you can see, the GUID already is the link, which should cause exactly what we want.

Other entries look fine:

    <item>
        <title>faq | bubbletea.dev</title>
        <link>https://bubbletea.dev/faq/</link>
        <guid isPermaLink="true">https://bubbletea.dev/faq/</guid>
        <pubDate>Wed, 09 Mar 2022 00:00:00 +0000</pubDate>
        <description> archivebox; booru; cannery; gitea; ipfs; irc; mealie; minecraft; misskey; mumble; neko; nextcloud; peertube; revolt; rss; rss-bridge; searx; start; wakapi ... </description>

    </item>

This should only come up once as the GUID shouldn't change.

So I think this is already implemented, it just sometimes doesnt work because the google link changes. I dont think there is an easy solution for this.

I would close this, do you agree @siccovansas ?

yamanq commented 2 years ago

Would it work to anything after ? in the URL? I'm not aware of search results that also send query parameters.

Bockiii commented 2 years ago

You can see a query parameter in my first example. Thats the first item when you search for rss-bridge.

I would close this as the root cause itself is basically already implemented. If sometimes the links change because of how google does things, I dont think it would make sense to try to fight that neverending battle just for a single "mark as read" click every once in a while :)

yamanq commented 2 years ago

Oops, I responded in too much of a hurry, so there are a couple typos in my previous response. I haven't seen any examples where the search result actually needs the query parameters, as they seem to be injected by Google itself. Because of that, I think that this issue can be resolved by adding a checkbox parameter to remove query parameters in the URL, which would keep them in a stable state. The help text for that checkbox parameter would explain the intended purpose of the checkbox.