Parse html to grab article content

iandioch commented 8 years ago

Python - beautifulsoup

CianLR commented 8 years ago

I'm interpreting this as the general issue for the scraper as that's basically what it's doing and there doesn't seem to be a specific issue for that

iandioch commented 8 years ago

This is the task of loading a web page that was listed in a feed, and picking out the text of the article

CianLR commented 8 years ago

Okay, I'll create an issue for the Scraper in that case On 7 Feb 2016 7:48 pm, "Noah Ó Donnaile" notifications@github.com wrote:

This is the task of loading a web page that was listed in a feed, and picking out the text of the article

— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181093686.

iandioch commented 8 years ago

Define "scraper" there?

CianLR commented 8 years ago

No, actually, maybe this is the right issue. Yeah basically the Scraper is the thing you give the url of a feed and it gives back the title, date and article text in some sort of dict On 7 Feb 2016 7:54 pm, "Noah Ó Donnaile" notifications@github.com wrote:

Define "scraper" there?

— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181094143.

iandioch commented 8 years ago

That's not the purpose of the program described in this issue. The program described here will be given a URL of a webpage and will use BeatifulSoup (or equiv) to parse the HTML of that page and grab the words of the article. It will only be run if the article content is not inlined in the RSS feed. What you seem to be describing is the module that will load the RSS itself and parse it - which is something that happens before the program in this issue is required.

iandioch commented 8 years ago

Although I think we should call both programs Scraper, ambiguity is hipster isn't it?

CianLR commented 8 years ago

Well since we're using a library for parsing RSS having them as separate issues is just cutting one function in a file in half On 7 Feb 2016 8:10 pm, "Noah Ó Donnaile" notifications@github.com wrote:

Although I think we should call both programs Scraper, ambiguity is hipster isn't it?

— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181098684.

iandioch commented 8 years ago

They were separate because:

Originally they were to be in totally separate languages
The RSS polling happens regularly, this only happens whenever there is a new entry in a feed, and that entry does not have inlined article content.
They really don't do anything in common when you look at it
Modularity, man
The page parsing (stuff in this issue) might end up becoming quite complicated, guessing which parts of the html are the article and which aren't. It might end up being quite a large module in itself.

CianLR commented 8 years ago

You bring up some well numbered points. We can talk about this best at tomorrow's stand up or tonight's call :+1: On 7 Feb 2016 8:38 pm, "Noah Ó Donnaile" notifications@github.com wrote:

They were separate because:

Originally they were to be in totally separate languages

The RSS polling happens regularly, this only happens whenever there is a new entry in a feed, and that entry does not have inlined article content.

They really don't do anything in common when you look at it

Modularity, man

The page parsing (stuff in this issue) might end up becoming quite complicated, guessing which parts of the html are the article and which aren't. It might end up being quite a large module in itself.

— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181111145.

iandioch commented 8 years ago

Markdown auto numbers ❤

CPSSD / feedlark

Parse html to grab article content #27