Closed iandioch closed 8 years ago
I'm interpreting this as the general issue for the scraper as that's basically what it's doing and there doesn't seem to be a specific issue for that
This is the task of loading a web page that was listed in a feed, and picking out the text of the article
Okay, I'll create an issue for the Scraper in that case On 7 Feb 2016 7:48 pm, "Noah Ó Donnaile" notifications@github.com wrote:
This is the task of loading a web page that was listed in a feed, and picking out the text of the article
— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181093686.
Define "scraper" there?
No, actually, maybe this is the right issue. Yeah basically the Scraper is the thing you give the url of a feed and it gives back the title, date and article text in some sort of dict On 7 Feb 2016 7:54 pm, "Noah Ó Donnaile" notifications@github.com wrote:
Define "scraper" there?
— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181094143.
That's not the purpose of the program described in this issue. The program described here will be given a URL of a webpage and will use BeatifulSoup (or equiv) to parse the HTML of that page and grab the words of the article. It will only be run if the article content is not inlined in the RSS feed. What you seem to be describing is the module that will load the RSS itself and parse it - which is something that happens before the program in this issue is required.
Although I think we should call both programs Scraper, ambiguity is hipster isn't it?
Well since we're using a library for parsing RSS having them as separate issues is just cutting one function in a file in half On 7 Feb 2016 8:10 pm, "Noah Ó Donnaile" notifications@github.com wrote:
Although I think we should call both programs Scraper, ambiguity is hipster isn't it?
— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181098684.
They were separate because:
You bring up some well numbered points. We can talk about this best at tomorrow's stand up or tonight's call :+1: On 7 Feb 2016 8:38 pm, "Noah Ó Donnaile" notifications@github.com wrote:
They were separate because:
- Originally they were to be in totally separate languages
- The RSS polling happens regularly, this only happens whenever there is a new entry in a feed, and that entry does not have inlined article content.
- They really don't do anything in common when you look at it
- Modularity, man
- The page parsing (stuff in this issue) might end up becoming quite complicated, guessing which parts of the html are the article and which aren't. It might end up being quite a large module in itself.
— Reply to this email directly or view it on GitHub https://github.com/CPSSD/feedlark/issues/27#issuecomment-181111145.
Markdown auto numbers ❤
Python - beautifulsoup