dchang0 / torrentwatch-xa

Resurrection of TorrentWatch-X automatic RSS/Atom torrent episode downloader (broadcatcher) with the extra capability of handling anime torrents
GNU General Public License v2.0
26 stars 3 forks source link

Source web page acquisition timeout too short #17

Closed JohnDarkhorse closed 3 years ago

JohnDarkhorse commented 3 years ago

I've noticed (using the code from 28 Jun 2021) that I'm getting a lot of "source website is not available" messages, without any other content.

I've found that - while using a normal web browser - accessing Cloudflare hosted websites can sometimes give the "website not available" (http 503, for example) but then go on to load the content within a second or two.

I'd like to suggest you increase the timeout on whatever mechanism is pulling the target site's content to give it a chance to overcome this Cloudflare boondoggle.

JohnDarkhorse commented 3 years ago

Also, while I'm here.

Have you considered keeping the configuration file in ~/.config/torrentwatch-xa.config ?

This would remove it from being part of the install sequence & allow the single user to edit as needed via text editor.

Alternatively, TWXA could look in both places for a valid config, with the newer (ostensibly in ~/.config) superseding the default.

( I bring this up because having the web-based "favorites" window close with each change made isn't nice )

dchang0 commented 3 years ago

Cloudflare and DDOS-Guard are really difficult to get around. Basically, they are designed to specifically target software like torrentwatch-xa, which they rightly consider to be a "bot."

As for your question, that might be outside of our control. The Javascript that Cloudflare sends to your browser is what is responsible for reloading the page; we can't change the timeout settings within their Javascript.

Best we can do is not trigger their Javascript (in other words, trick them into thinking torrentwatch-xa is not a bot, even though it is). I haven't figured out how to do that yet. If I attempt to do it, I'll use a 3rd party library designed to combat CloudFlare. But it becomes a wild goose chase--as the the Cloudflare defeaters get better, Cloudflare gets better, which breaks the Cloudflare defeaters, and so on.

The best way is to ask the operator of the feed to stop putting their RSS feed behind Cloudflare. This is what happened with Nyaa.si. The community complained and the operators of Nyaa moved their RSS feed to a server not behind Cloudflare.

The second-best way is to feed their RSS feed through some other aggregator or to find an aggregator. For instance, AnimeTosho.org is an aggregator for Nyaa, Anidex, and TokyoTosho.info, and since Anidex is behind DDOS-Guard, AnimeTosho.org is the way that I get to Anidex.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "Subscribed" @.***> Sent: 6/29/2021 11:36:31 Subject: [dchang0/torrentwatch-xa] Web page timeout too short (#17)

I've noticed (using the code from 28 Jun 2021) that I'm getting a lot of "source website is not available" messages, without any other content.

I've found that - while using a normal web browser - accessing Cloudflare hosted websites can sometimes give the "website not available" (http 503, for example) but then go on to load the content within a second or two.

I'd like to suggest you increase the timeout on whatever mechanism is pulling the target site's content to give it a chance to overcome this Cloudflare boondoggle.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62QCTXPWGAQSFTAU24DTVIHC7ANCNFSM47QZJ3YQ.

dchang0 commented 3 years ago

I have actually been thinking about making it so that the Favorites window can be pinned open or has an Apply button. That's in the long-term TODO list.

As for looking for a custom config, that's kinda a big change too (I would have to read files if found in both places and merge the contents into one).

Most likely, I'll do the Favorites with Apply button.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "Subscribed" @.***> Sent: 6/29/2021 11:59:13 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

Also, while I'm here.

Have you considered keeping the configuration file in ~/.config/torrentwatch-xa.config ?

This would remove it from being part of the install sequence & allow the single user to edit as needed via text editor.

Alternatively, TWXA could look in both places for a valid config, with the newer (ostensibly in ~/.config) superseding the default.

( I bring this up because having the web-based "favorites" window close with each change made isn't nice )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-870838943, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62V2PLJYZBZTPTF4YEDTVIJYDANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

cloudflare

I have been running a brute force torrent grabber (written in bash, pulls the rss feed anew for every single show on the list) and have never encountered any issues connecting to the source site.

Not sure how TWXA is grabbing the rss feed, but is adding a modern user agent to the request something that could be done?

dchang0 commented 3 years ago

Cloudflare's too smart for spoofed User-Agents. TorrentWatch-X spoofs it; torrentwatch-xa inherited that behavior, but when I realized it didn't help at all, I set the User-Agent to correctly call itself by name.

You can overwrite the User-Agent if you like by changing one line in the source code. But I doubt it will work. Or if it does work, it won't work for long, because Cloudflare will deduce that torrentwatch-xa is a bot by its behavior (contacting every 15 minutes from the same IP address is a dead giveaway).

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/29/2021 19:13:53 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

cloudflare

I have been running a brute force torrent grabber (written in bash, pulls the rss feed anew for every single show on the list) and have never encountered any issues connecting to the source site.

Not sure how TWXA is grabbing the rss feed, but is adding a modern user agent to the request something that could be done?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871044637, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62XYAJA6744ARMSJ2K3TVJ4WDANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

I'm not familiar with git at all.

Do I just clone this stack and make my changes to it and poke you?

 

For example:

Example 1: The Glob Filter "zombie*land" will match:

Zombieland Saga

Zombie Land Saga

This is incorrect, as "zombie*land" will not match "Zombieland Saga". The asterisk stands for any combination of or single character or numeral, and there is no space (considered a character) in "zombieland".

Now "zombiland" would* match both "zombieland" and "zombie land", as the asterisk would fill in the "e" and the space after it.

dchang0 commented 3 years ago

Yep. I don't really use GitHub or git properly either. For now just post your code changes to this issue.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 16:25:16 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

I'm not familiar with git at all.

Do I just clone this stack and make my changes to it and poke you?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871790228, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62TL3SNTBSTURSYDST3TVORVZANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

For the glob example:

Example 4: The glob filter "Greyman*[Group1 || Group2]" will match & download the first "Greyman" that comes along, from either group.

This can also work for resolutions: "Greyman*[1080p || 720p || webrip]" where the first "Greyman" that came along matching one of those resolutions would be matched & downloaded.

dchang0 commented 3 years ago

Interesting--have you tested the resolutions one? It shouldn't work (the resolution is stripped from the title in the matching process).

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 16:38:43 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

For the glob example:

Example 4: The glob filter "Greyman*[Group1 || Group2]" will match & download the first "Greyman" that comes along, from either group.

This can also work for resolutions: "Greyman*[1080p || 720p || webrip]" where the first Greyman that came along matching one of those resolutions would be matched & downloaded.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871794805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62UBPEHODKMBYHM65RDTVOTIHANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

Interesting--have you tested the resolutions one? It shouldn't work (the resolution is stripped from the title in the matching process).

We're discussing the "Filters" field are we not?

 

Unrelated to the above, but is adding support for qbittorrent just a matter of finding & replacing the transmission entries?

dchang0 commented 3 years ago

Correct.

Filter: should only operate on the title (after the title has been stripped of the resolutions and qualities). Quality: should operate only on the resolutions and qualities.

If it turns out that Filter also works on resolutions and qualities, then something is wrong in my code or my design of the Filter: and Quality: logic.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 17:14:05 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

Interesting--have you tested the resolutions one? It shouldn't work (the resolution is stripped from the title in the matching process).

We're discussing the "Filters" field are we not?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871809384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62VVM7FE6UXASOWBR2DTVOXM3ANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

Well, my intent was to point out the [item1 || item2] "or" array that can be used with the "glob" setting.

You write it how it works (I know the group names example works [at least it works for me])

dchang0 commented 3 years ago

No worries. You may have just discovered a new bug--that's actually a good thing. I'll have to figure out why it works when it shouldn't work.

I'll roll your example into the Glob documentation. Thanks!!

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 17:43:20 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

Well, my intent was to point out the [item1 || item2] "or" array that can be used with the "glob" setting.

You write it how it works (I know the group names example works [at least it works for me])

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871821895, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62TEYVDUEU4TS64RNVLTVO22RANCNFSM47QZJ3YQ.

dchang0 commented 3 years ago

I've confirmed that the [item1||item2] works--nice find! (It's not mentioned in any of the documentation for the glob pattern language that I have found so far, but it works.)

I will definitely include it as Example 4.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 17:43:20 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

Well, my intent was to point out the [item1 || item2] "or" array that can be used with the "glob" setting.

You write it how it works (I know the group names example works [at least it works for me])

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871821895, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62TEYVDUEU4TS64RNVLTVO22RANCNFSM47QZJ3YQ.

dchang0 commented 3 years ago

Okay, I've added your example 4 to USAGE.md.

Thanks for your contribution!

I will test out the resolutions bug later.

------ Original Message ------ From: "JohnDarkhorse" @.> To: "dchang0/torrentwatch-xa" @.> Cc: "dchang0" @.>; "Comment" @.> Sent: 6/30/2021 17:43:20 Subject: Re: [dchang0/torrentwatch-xa] Source web page acquisition timeout too short (#17)

Well, my intent was to point out the [item1 || item2] "or" array that can be used with the "glob" setting.

You write it how it works (I know the group names example works [at least it works for me])

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871821895, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62TEYVDUEU4TS64RNVLTVO22RANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

You've removed the spaces from the "or" operator (i've no idea if it'll work that way or not) see here. Should be OPENBRACKET ITEM1 SPACE PIPE PIPE SPACE ITEM2 CLOSEBRACKET

 

Also, glob example #1 in the usage.md is still wrong see here

dchang0 commented 3 years ago

Re: Glob example 1: is correct because * can match zero characters.

From https://www.php.net/manual/en/function.glob.php

So, Zombie*land will match Zombieland.

You can test it with this simple PHP script that uses the same fnmatch() function that torrentwatch-xa uses.

globtest.php:

<?php $title = "Zombieland Saga"; $pattern = "Zombie*land Saga"; if(fnmatch($pattern, $title)) { print "hit!\n"; } else { print "miss!\n"; }

This will print "hit!" when run.

Similarly, regarding the removal of the spaces around the pipe signs:

<?php $title = "Grayman Erai-raws"; $pattern = "Grayman*[Erai-raws||SSA]"; if(fnmatch($pattern, $title)) { print "hit!\n"; } else { print "miss!\n"; }

This prints "hit!"

And so does this:

<?php $title = "Grayman Erai-raws"; $pattern = "Grayman*[Erai-raws || SSA]"; if(fnmatch($pattern, $title)) { print "hit!\n"; } else { print "miss!\n"; }

And so does this:

<?php $title = "Grayman SSA"; $pattern = "Grayman*[Erai-raws || SSA]"; if(fnmatch($pattern, $title)) { print "hit!\n"; } else { print "miss!\n"; }

So the spaces around the pipes are optional, apparently.

On Jun 30, 2021, at 7:19 PM, JohnDarkhorse @.***> wrote:

You've removed the spaces from the "or" operator (i've no idea if it'll work that way or not) see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871794805. Should be OPENBRACKET ITEM1 SPACE PIPE PIPE SPACE ITEM2 CLOSEBRACKET

Also, glob example #1 https://github.com/dchang0/torrentwatch-xa/issues/1 in the usage.md is still wrong see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871790228 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871860123, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62VFP5JM267H77LHZ53TVPGEDANCNFSM47QZJ3YQ.

dchang0 commented 3 years ago

BTW, these are the exact lines of code used to do the Glob matching.

It starts at line 98 of twxa_feed.php.

    case 'glob':
        $hit = (($item['Filter'] !== '' && fnmatch(strtolower($item['Filter']), $ti)) &&
                ($item['Not'] === '' OR!fnmatch(strtolower($item['Not']), $ti)) &&
                (strtolower($item['Quality']) == 'all' OR $item['Quality'] === '' OR strpos($ti, strtolower($item['Quality'])) !== false));
        break;

Note that we have strtolower so that all uppercase characters are replaced with lowercase (effectively makes the matching case-insensitive). $ti (the title) is already converted to lowercase with strtolower() earlier. Also note that Qualities is separate from Filter (but there could be a bug, so I need to investigate this by submitting various test titles and watching the variables.

On Jun 30, 2021, at 7:19 PM, JohnDarkhorse @.***> wrote:

You've removed the spaces from the "or" operator (i've no idea if it'll work that way or not) see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871794805. Should be OPENBRACKET ITEM1 SPACE PIPE PIPE SPACE ITEM2 CLOSEBRACKET

Also, glob example #1 https://github.com/dchang0/torrentwatch-xa/issues/1 in the usage.md is still wrong see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871790228 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871860123, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62VFP5JM267H77LHZ53TVPGEDANCNFSM47QZJ3YQ.

dchang0 commented 3 years ago

For faster testing and design of Glob patterns, this website should work.

https://www.digitalocean.com/community/tools/glob?comments=true&glob=Zombie%2Aland%20Saga&matches=false&tests=Zombieland%20Saga

However, I don't know if they use PHP's version of the Glob pattern language. Note that there are subtle differences between everyone's implementation of the Glob pattern language, so it's safer to stick with PHP's fnmatch() function since that's exactly what we use.

But maybe the DigitalOcean tool is useful anyway, since the subtle differences are rare.

On Jun 30, 2021, at 7:19 PM, JohnDarkhorse @.***> wrote:

You've removed the spaces from the "or" operator (i've no idea if it'll work that way or not) see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871794805. Should be OPENBRACKET ITEM1 SPACE PIPE PIPE SPACE ITEM2 CLOSEBRACKET

Also, glob example #1 https://github.com/dchang0/torrentwatch-xa/issues/1 in the usage.md is still wrong see here https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871790228 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871860123, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62VFP5JM267H77LHZ53TVPGEDANCNFSM47QZJ3YQ.

JohnDarkhorse commented 3 years ago

Well, I started digging up "glob" and found it is officially called "Parameter Expansion"

Such research revealed the two pipes "or" usage.

Guess I will go looking at "fnmatch()" . . .

dchang0 commented 3 years ago

Wow, that's pretty advanced then. That would explain why none of the glob documentation I've seen so far uses double-pipes or square brackets to match entire words (square brackets are supposed to only match a single character, same as in PCRE language).

Well, hey, if it works, it works. What matters to us is if you can deterministically get the torrents you want.

Sadly, the official PHP documentation on both fnmatch() and glob() is much too sparse. I wish they'd at least go into detail since they're the authority on how they implemented the Glob pattern language in the PHP language. No one else will know as much about the design compromises they made as they do.

On Jun 30, 2021, at 8:12 PM, JohnDarkhorse @.***> wrote:

Well, I started digging up "glob" and found it is officially called "Parameter Expansion https://duckduckgo.com/?q=Parameter+Expansions&k8=%23444444&k9=%23D51920&kt=h&ia=web"

Such research revealed the two pipes "or" usage.

Guess I will go looking at "fnmatch()" . . .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dchang0/torrentwatch-xa/issues/17#issuecomment-871879677, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDY62TNN2W2333U54K5Z2DTVPMI3ANCNFSM47QZJ3YQ.