cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
919 stars 170 forks source link

Retreive date, created_on and updated_on #69

Closed pranav7 closed 10 years ago

pranav7 commented 10 years ago

How can I extract the date on which the page being retrieved was 'created' and 'updated'? I tried using the method 'date_published' which is the JSON element that is exposed by the Readability Parser API, but that did not of course work.

I am not exactly sure if there is already a way to do it, but if there isn't, it would be great if we can have a method that does this. However, if there is, this is not exactly an Issue.

cantino commented 10 years ago

HTML documents don't have a published at or created at element. At least not reliably.

pranav7 commented 10 years ago

Agreed. Is there someway I could access the 'date_published' JSON tag that is exposed by the Parser API?

https://www.readability.com/developers/api/parser

ghost commented 10 years ago

No, ruby-readability is a port of the older, open source JavaScript Readability library. Their newer features are not available, and I don't know how they determine the date_published metadata. Either they have a list of common tags to look for, or they're referring to the date that they fetched the data. You could try adding something like that to this library, or look into using their API.

On Sun, Apr 13, 2014 at 3:31 PM, Pranav Singh notifications@github.comwrote:

Agreed. Is there someway I could access the 'date_published' JSON tag that is exposed by the Parser API?

https://www.readability.com/developers/api/parser

Reply to this email directly or view it on GitHubhttps://github.com/cantino/ruby-readability/issues/69#issuecomment-40322181 .

Iteration Labs, LLC Andrew Cantino Founder / CEO

pranav7 commented 10 years ago

Aah! alright. I'd surely contribute if I figure something out. Thanks anyway. :smiley: