XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
278 stars 93 forks source link

tv_grab_na_dtv #82

Closed ejonesnospam closed 4 years ago

ejonesnospam commented 4 years ago

Summary: Add timeout, agent to command line options. Improve error control

What type of Pull Request is this?

Does this PR close any currently open issues?

tv_grab_na_dtv stopped working #80

Please explain what this PR does

Adds timeout and browser agent string to command line options. Improves error control for any malformed or incomplete JSON response. Previously, the code would fault, and not save any output. Now, the code continues after providing an appropriate log.

Any other information?

Where have you tested these changes?

Operating System: debian 10

Perl Version: … perl 5, version 24, subversion 1 (v5.24.1)

ejonesnospam commented 4 years ago

Personally, I don't care what the default User-Agent is. The user can change the user-agent via the command line with this code pull.

During my testing, I found DirecTV was periodically limiting data requests, sometimes appearing to limit based off User-Agent and not source IP address.

knowledgejunkie commented 4 years ago

@ejonesnospam Thank you for your contribution.

I'm working through the PR to try and understand what is needed and what the issue is upstream. I just retested the unpatched grabber with 4 channels enabled (different to those in @MaximVol config in #80) and it ran without problems. With the config in #80 it failed.

Some initial comments:

UserAgent

We wouldn't add an arbitrary default UserAgent string to lib/Options.pm. The standard XMLTV UserAgent string for the current stable release is "xmltv/0.6.1", and is set automatically when using XMLTV::Get_nice.

If a remote site explicitly blocks this UA, it can be a sign that XMLTV is not welcome (or that the remote site is not configured to accept connections from non-browser agents it doesn't understand...).

Grabbers creating their own LWP::UserAgent should use the standard XMLTV UserAgent string.

Timeout

The default LWP::UserAgent timeout is 180s. Most grabbers work fine with this. If a specific grabber requires more than 3 minutes for a response to a request, I take the view it should be grabber-specific.

Adding the timeout and agent support to the "baseline" listref in lib/Options.pm would require that all grabbers support them.

Refs: #80

ejonesnospam commented 4 years ago

First, I hope everyone is being safe and staying sane. I'm good, just occasionally going stir-crazy.. under lock down with my family and farting, diabetic cat.. ;)

Nick, I'm not sure any web scrapper/grabber is ever "welcome" by a site administrator. [More like how you tolerate in-laws, right?]

At any given moment, a web service can implement numerous thwarting techniques to block XMLTV from success. As outsiders, we can only only really guess or infer what roadblocks are active at any given time. Constraining a user to a specific option value [User Agent] is befuddling to me. Why would providing a user with more runtime options not be preferred?

I can only speak from my usage and testing experience... that these options allowed me to continue to utilize the hard work from countless XMLTV developers, and retrieve the guide data I needed with a near 0% failure rate.

If the options were specific only for 'tv_grab_na_dtv', would that be more acceptable by code owners?

Cheers

On Sun, Mar 29, 2020 at 12:16 PM Nick Morrott notifications@github.com wrote:

@ejonesnospam https://github.com/ejonesnospam Thank you for your contribution.

I'm working through the PR to try and understand what is needed and what the issue is upstream. I just retested the unpatched grabber with 4 channels enabled (different to those in @MaximVol https://github.com/MaximVol config in #80 https://github.com/XMLTV/xmltv/issues/80) and it ran without problems. With the config in #80 https://github.com/XMLTV/xmltv/issues/80 it failed.

Some initial comments:

UserAgent

We wouldn't add an arbitrary default UserAgent string to lib/Options.pm. The standard XMLTV UserAgent string for the current stable release is "xmltv/0.6.1", and is set automatically when using XMLTV::Get_nice.

If a remote site explicitly blocks this UA, it can be a sign that XMLTV is not welcome (or that the remote site is not configured to accept connections from non-browser agents it doesn't understand...).

Grabbers creating their own LWP::UserAgent should use the standard XMLTV UserAgent string.

Timeout

The default LWP::UserAgent timeout is 180s. Most grabbers work fine with this. If a specific grabber requires more than 3 minutes for a response to a request, I take the view it should be grabber-specific.

Adding the timeout and agent support to the "baseline" listref in lib/Options.pm would require that all grabbers support them.

Refs: #80 https://github.com/XMLTV/xmltv/issues/80

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XMLTV/xmltv/pull/82#issuecomment-605685315, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQG6QKMZX3NDU5UQWMJYTRJ6NAJANCNFSM4KGWCSYA .

rmeden commented 4 years ago

Generally, if scraping is ignored and not specifically against a websites terms of service, the grabber is allowed.

If a grabber is explicitly against the sites terms of service or there is evidence it is actively blocked, then the grabber is removed. No one wants to get involved in a legal mess.

knowledgejunkie commented 4 years ago

@ejonesnospam @MaximVol @dimitry-ishenko

Please test the updates to the grabber I've just pushed, maintaining ejones's authorship. I've amended the PR to keep the timeout and agent string overrides local to the grabber, and kept the raised default timeout at 240s.

ejonesnospam commented 4 years ago

tv_grab_na_dtv is working great for me with 0.6.1/master branch. Download sizes are exactly what I expect.

Thanks for the effort, and nice work!

On Sun, Mar 29, 2020 at 5:30 PM Nick Morrott notifications@github.com wrote:

@ejonesnospam https://github.com/ejonesnospam @MaximVol https://github.com/MaximVol @dimitry-ishenko https://github.com/dimitry-ishenko

Please test the updates to the grabber I've just pushed, maintaining ejones's authorship. I've amended the PR to keep the timeout and agent string overrides local to the grabber, and kept the raised default timeout at 240s.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/XMLTV/xmltv/pull/82#issuecomment-605726718, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQG6SC5NXIN2536CRSKYLRJ7R2XANCNFSM4KGWCSYA .