derat / nitter-rss-proxy

Moved to codeberg.org/derat/nitter-rss-proxy
https://codeberg.org/derat/nitter-rss-proxy
BSD 3-Clause "New" or "Revised" License
11 stars 3 forks source link

Some Nitter URLs in tweets aren't rewritten #13

Closed cxplay closed 1 year ago

cxplay commented 1 year ago

I'm sorry to interrupt again. Hope that the nitter link in the text can also be processed to the original Twitter link. Like username and other tweet link.

derat commented 1 year ago

Sorry, I don't understand -- Nitter links should already be rewritten by default. Can you provide an example?

cxplay commented 1 year ago

Oh yes, you can check this Twitter user AnimalsHolbox, then searcher keywords "nitter", in this tweet id https://twitter.com/AnimalsHolbox/status/1641392056848330754, some hashtags not be processed:

  <entry>
    <title>These cruel and useless experiments were done by #UWMadison to receive more tha…</title>
    <updated>2023-03-30T10:48:56Z</updated>
    <id>https://twitter.com/AnimalsHolbox/status/1641392056848330754</id>
    <content type="html">&lt;p&gt;These cruel and useless experiments were done by &lt;a href=&#34;https://nitter.privacydev.net/search?q=%23UWMadison&#34;&gt;#UWMadison&lt;/a&gt; to receive more than $3 million in tax money through the &lt;a href=&#34;https://nitter.privacydev.net/search?q=%23NIH&#34;&gt;#NIH&lt;/a&gt; : πŸ’° πŸ’° πŸ’° πŸ’· πŸ’³&lt;/p&gt;&lt;br&gt;&lt;img src=&#34;https://pbs.twimg.com/ext_tw_video_thumb/928220573418717184/pu/img/hi68jyf983uMDpwe.jpg&#34; style=&#34;max-width:250px;&#34; /&gt;</content>
    <link href="https://twitter.com/AnimalsHolbox/status/1641392056848330754" rel="alternate"></link>
    <summary type="html">&lt;p&gt;These cruel and useless experiments were done by &lt;a href=&#34;https://nitter.privacydev.net/search?q=%23UWMadison&#34;&gt;#UWMadison&lt;/a&gt; to receive more than $3 million in tax money through the &lt;a href=&#34;https://nitter.privacydev.net/search?q=%23NIH&#34;&gt;#NIH&lt;/a&gt; : πŸ’° πŸ’° πŸ’° πŸ’· πŸ’³&lt;/p&gt;&lt;br&gt;&lt;img src=&#34;https://pbs.twimg.com/ext_tw_video_thumb/928220573418717184/pu/img/hi68jyf983uMDpwe.jpg&#34; style=&#34;max-width:250px;&#34; /&gt;</summary>
    <author>
      <name>@AnimalsHolbox</name>
    </author>
  </entry>
cxplay commented 1 year ago

I also noticed that this seems to be a problem with the instance? Because not every time there is an unprocessed nitter link. I'm not sure... Sometimes it's a picture, sometimes it's a username or a link to another tweet referenced within a tweet, which is very strange.

derat commented 1 year ago

Thanks for the details.

In the account that you gave, it looks like search URLs like https://nitter.privacydev.net/search?q=%23UWMadison aren't being rewritten. The proxy doesn't currently rewrite /search URLs (you can see the list of rewrites in rewritePatterns). I'm worried that it might be hard to rewrite these in a generic way without also breaking some non-Nitter links, since /search?q=... seems like a common pattern. Maybe /search?q=#... (note the #) would be fairly safe, though.

Can you provide some examples of non-search URLs that also aren't being rewritten? Those might be safer to fix.

cxplay commented 1 year ago

Sure, but I think by default it uses a random instance, and I can only find it this way: by using an account with a lot of valid tweet inlinks, like nasa. As I said earlier, the problem doesn't always occur on a fixed instance, so it takes several refreshes to find the one that can't handle it. In this way, I found several instances: nitter.kylrth.com, nitter.poast.org, nitter.fdn.fr, where almost all of their links were not processed correctly.

derat commented 1 year ago

Note that you can pass e.g. -instances https://nitter.kylrth.com to use a specific instance. Maybe some of the instances are formatting links in such a way that they aren't matched by the proxy's regular expressions.

cxplay commented 1 year ago

Note that you can pass e.g. -instances https://nitter.kylrth.com to use a specific instance. Maybe some of the instances are formatting links in such a way that they aren't matched by the proxy's regular expressions.

This is indeed a solution, but it loses the support of twiiit.com, which is designed to avoid single points of failure. What about the "-instance" parameter, does it support specifying multiple instances for polling?

derat commented 1 year ago

Yes, you can supply multiple comma-separated instances via -instance, e.g. -instances https://n1.example.org,https://n2.example.org.

But just to be clear, I was just suggesting -instances so you could use it to provide more examples of URLs that aren't being rewritten correctly.

cxplay commented 1 year ago

Yes, you can supply multiple comma-separated instances via -instance, e.g. -instances https://n1.example.org,https://n2.example.org.

But just to be clear, I was just suggesting -instances so you could use it to provide more examples of URLs that aren't being rewritten correctly.

Okay, I get it.

cxplay commented 1 year ago

Well, I've recreated a test instance, which in addition to the timeout parameter being 20, the instance parameter inserts 83 valid instances filtered from the instance list, so you can check the link to the instances: https://nitter-rss-proxy-mod-v2.fly.dev/nasa. My own testing has been able to observe that almost all of the nitter links are not being handled correctly.

Instances in use:

https://nitter.lacontrevoie.fr,https://nitter.1d4.us,https://nitter.kavin.rocks,https://nitter.unixfox.eu,https://birdsite.xanny.family,https://nitter.moomoo.me,https://twitter.censors.us,https://nitter.grimneko.de,https://twitter.076.ne.jp,https://nitter.fly.dev,https://notabird.site,https://nitter.weiler.rocks,https://nitter.sethforprivacy.com,https://nitter.cutelab.space,https://nitter.nl,https://nitter.mint.lgbt,https://nitter.bus-hit.me,https://nitter.esmailelbob.xyz,https://tw.artemislena.eu,https://nitter.tiekoetter.com,https://nitter.spaceint.fr,https://nitter.privacy.com.de,https://nitter.poast.org,https://nitter.bird.froth.zone,https://nitter.dcs0.hu,https://twitter.dr460nf1r3.org,https://nitter.garudalinux.org,https://twitter.femboy.hu,https://nitter.privacydev.net,https://nitter.kylrth.com,https://nitter.foss.wtf,https://unofficialbird.com,https://nitter.projectsegfau.lt,https://nitter.eu.projectsegfau.lt,https://singapore.unofficialbird.com,https://canada.unofficialbird.com,https://india.unofficialbird.com,https://nederland.unofficialbird.com,https://uk.unofficialbird.com,https://nitter.qwik.space,https://read.whatever.social,https://nitter.rawbit.ninja,https://nitter.privacytools.io,https://nitter.sneed.network,https://n.sneed.network,https://nitter.smnz.de,https://nitter.twei.space,https://nitter.inpt.fr,https://nitter.d420.de,https://nitter.caioalonso.com,https://nitter.at,https://nitter.pw,https://nitter.nicfab.eu,https://bird.habedieeh.re,https://nitter.hostux.net,https://nitter.adminforge.de,https://nitter.platypush.tech,https://nitter.pufe.org,https://nitter.us.projectsegfau.lt,https://nitter.arcticfoxes.net,https://t.com.sb,https://nitter.kling.gg,https://nitter.ktachibana.party,https://nitter.riverside.rocks,https://ntr.odyssey346.dev,https://nitter.lunar.icu,https://twitter.moe.ngo,https://nitter.freedit.eu,https://ntr.frail.duckdns.org,https://nitter.librenode.org,https://n.opnxng.com,https://nitter.plus.st,https://nitter.in.projectsegfau.lt,https://nitter.tux.pizza,https://t.floss.media,https://twit.hell.rodeo,https://nitter.edist.ro,https://twt.funami.tech,https://nitter.nachtalb.io,https://n.quadtr.ee,https://nitter.altgr.xyz,https://jote.lile.cl,https://nitter.one
derat commented 1 year ago

Almost all of the non-rewritten URLs that I see have paths of the forms /<username> or /search?q=#...:

% nitter-rss-proxy -format json -instances https://nitter.kylrth.com -user nasa 2>/dev/null | \
    grep -oP 'https?://nitter\.kylrth\.com/[^\\]*'
http://nitter.kylrth.com/chandraxray
http://nitter.kylrth.com/NASA
http://nitter.kylrth.com/BoeingSpace
http://nitter.kylrth.com/search?q=%23Starliner
http://nitter.kylrth.com/NASA_Astronauts
http://nitter.kylrth.com/Space_Station
http://nitter.kylrth.com/BoeingSpace
http://nitter.kylrth.com/search?q=%23Starliner
http://nitter.kylrth.com/Space_Station
http://nitter.kylrth.com/NASA
http://nitter.kylrth.com/search?q=%23Artemis
http://nitter.kylrth.com/NASASTEM
http://nitter.kylrth.com/search?q=%23YourPlaceInSpace
http://nitter.kylrth.com/BoeingSpace
http://nitter.kylrth.com/search?q=%23Starliner
http://nitter.kylrth.com/Space_Station
http://nitter.kylrth.com/Space_Station
http://nitter.kylrth.com/search?q=%23Artemis
http://nitter.kylrth.com/NASA_Orion
http://nitter.kylrth.com/DoNASAScience
http://nitter.kylrth.com/nasa_eyes
http://nitter.kylrth.com/pic/card_img%2F1639869985400193025%2FeVRqQkMJ%3Fformat%3Djpg%26name%3D420x420_2
http://nitter.kylrth.com/search?q=%23Artemis
http://nitter.kylrth.com/NASAArtemis
http://nitter.kylrth.com/POTUS
http://nitter.kylrth.com/csa_asc
http://nitter.kylrth.com/search?q=%23Artemis
http://nitter.kylrth.com/CNES
http://nitter.kylrth.com/JHUAPL
http://nitter.kylrth.com/search?q=%23Dragonfly
http://nitter.kylrth.com/search?q=%23AskAstrobio

It's tricky for the proxy to rewrite most of these URLs when using the https://twiiit.com redirector, since the URLs will have an arbitrary hostname belonging to the underlying Nitter instance. However, it looks like twiiit.com issues a redirect, so the proxy can probably use that to figure out which hostname it needs to look for.

derat commented 1 year ago

Please let me know if you still see non-rewritten URLs in a binary that includes 6d66c12533c68cd22a4bae67ada9cbd41806b6a1 and 4825e4deb2f292e3e0e540ab94e36761b9dbe310.

cxplay commented 1 year ago

Please let me know if you still see non-rewritten URLs in a binary that includes 6d66c12533c68cd22a4bae67ada9cbd41806b6a1 and 4825e4deb2f292e3e0e540ab94e36761b9dbe310.

OK!

cxplay commented 1 year ago

Today I have filtered out some examples that still have problems after testing the new version of the binaries, These instances are still from this list of instances, all working fine until now!

Next I would like to describe the status of the instances that still have problems. In addition to the instance server errors, most of the instances that still have errors are instances that reverse proxy or redirect to other nitter instances, for example:

notabird.site >> nitter.fly.dev
twitter.dr460nf1r3.org >> nitter.garudalinux.org
nitter.privacytools.io >> nitter.net
n.sneed.network >> nitter.sneed.network

There is also a list of instances that appear to be the same organization that cannot be rewritten directly:

nitter.in.projectsegfau.lt
nitter.projectsegfau.lt
nitter.eu.projectsegfau.lt

Therefore, I recommend replacing the default twiiit.com to the official nitter instance(nitter.net) in the binary, because after testing, not all instances from the list are suitable for handling by a proxy server, because some instances are broken in their RSS generation even though they work fine as Twitter front-ends (probably due to instance server issues). It is therefore recommended to use the official instances rather than a list with uncertainty (the official instances have been tested with rewrites).

derat commented 1 year ago

Thanks for the additional testing!

I've changed the default back to nitter.net as you recommended. That was the original default, but I moved away from it in aab0616f73f0f13fa88b2a693772ebb128623570, apparently due to it sometimes returning bogus "user not found" errors. Hopefully whatever the problem was has been fixed -- it worked the few times that I tried it just now.

One other option that I thought about was changing the proxy's code to rewrite all URLs with "nitter" appearing in the hostname. That seems like it would handle all of the instances that you listed, but I'm not sure that it's a good idea.

cxplay commented 1 year ago

If I understand correctly, some nitter instance servers do not always use custom domains with "nitter", and may have mismatches if proxy server rely on a single keyword match. My suggestion is to provide a list of keywords to check and replace, like the -instance parameter, and possibly define one or more keywords for each instance separately (since there may be multiple redirects) to handle.

derat commented 1 year ago

Hmm, I'd rather not add more configuration and code to handle weird instances. If a particular instance causes problems due to a strange setup, I think it's straightforward enough to just not pass it via the -instances flag.

cxplay commented 1 year ago

Hmm, I'd rather not add more configuration and code to handle weird instances. If a particular instance causes problems due to a strange setup, I think it's straightforward enough to just not pass it via the -instances flag.

Indeed. also the idea of "rewriting everything with nitter keywords" is worth considering, perhaps opening a new branch for testing feasibility? Because the current rewrite method is already perfect, but new matching rules are also worth trying.

cxplay commented 1 year ago

Also, for the "user not found" error, I've found it on other instances, so maybe the proxy server detects that this output can cycle directly to the next preset instance?

derat commented 1 year ago

Many of the instances listed at https://github.com/zedeus/nitter/wiki/Instances don't have nitter in their hostnames, so I don't think there's much value in adding a rewrite pattern that will only work sometimes.

Regarding "user not found" errors, it probably makes more sense to report this as a Nitter bug (if it's not already reported). I don't want the proxy to send a bunch of extra unnecessary requests whenever someone mistypes a username or a Twitter account is deleted. (If there's some way to distinguish between bogus errors and real ones caused by nonexistent users, I'm happy to add code to move on to the next instance, though.)

cxplay commented 1 year ago

Well, so far the problem has been solved, congratulations!

derat commented 1 year ago

Thanks again for reporting this and testing it!