Swader / diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
MIT License
53 stars 20 forks source link

PHP Notices generated on Article\getDate() if no date has been identified #54

Closed jonathantullett closed 7 years ago

jonathantullett commented 7 years ago

When using getDate() on an Article entity, if no date has been identified, the following error is generated: PHP Notice: Undefined index: date in /path/to/lib/vendor/swader/diffbot-php-client/src/Entity/Article.php on line 65

This is:

    public function getDate()
    {
        return (class_exists('\Carbon\Carbon')) ?
            new \Carbon\Carbon($this->data['date'], 'GMT') :
            $this->data['date'];
    }

Refactoring to this resolves it:

    public function getDate()
    {
        $date = isset($this->data['date']) ? $this->data['date'] : null;
        return (class_exists('\Carbon\Carbon')) ?
            new \Carbon\Carbon($date, 'GMT') : $date;
    }

Can you apply if it's in the style you want, else do similar?

Thanks,

Swader commented 7 years ago

The refactoring would return the current date, which is not something that would be useful data-wise. Can you give me the link of the article that failed to produce a date so I can see why? It's likely something needs to be fixed on Diffbot's end about that.

The modification that would fix this would be to just return null if no date exists, rather than a Carbon instance of current time, but I'm also looking at making a "data scanner" which would detect these anomalies and report them to me and Diffbot HQ (articles without dates, products without prices, etc) for better handling.

jonathantullett commented 7 years ago

Apologies for the delay on getting back to you on this.

I have created a list of ~3000 articles which I crawled but no date was extracted from. Could you take a look at see the best way to proceed on this? http://data.jonathantullett.com/articles_with_no_date_extracted.list

I agree that defaulting to 'now' isn't the right thing to do, but we need a higher hit rate on date extraction.

(not all those urls are articles so we need to tune our parameters, but enough are that should be detected by diffbot). Thanks for your help!

Swader commented 7 years ago

Thanks for the list - on posts that simply have no date data anywhere, extraction won't be possible. We'll do our best with all the others, though, and add them to our training set. I'll close this issue for now.

jonathantullett commented 7 years ago

I'm not sure about closing this just yet - a PHP Notice shouldn't be generated during normal execution. Can it be tweaked to return null if there's no date? (I have it locally, but I'd prefer not to have custom patches to code I'm installing through composer).

Thanks!

Swader commented 7 years ago

Good point, I definitely shouldn't assume date exists on all entities. Let me make a quick fix.

Swader commented 7 years ago

Published a new release, 2.0.5.