Closed IanQS closed 6 years ago
@bryantpq Awesome job overall!
I've reviewed, and added the threading + handled the abstraction. I'd say all that's remaining now is:
[x] for you to address the things I tagged you in (scan this page and open the "show outdated links").
[x] for you to do a quick review (on top of my review)
[ ] for me to test it really quick
and we can merge this + move on to other things :)
@IanQS The scraper should be working. To run it (at least how i've been doing it):
src/constants.py
and src/news_api_key.py
have been set up.krak_trader/src
python3 -m kraken_scraper.news_data.news_api_scraper
If you know any other ways to run the scraper then lmk.
@bryantpq
Sweet! Was just making sure things were running. Things are going well now. Some final things to address
1) Coin Telegraph errors: (CCN and techcrunch work fine)
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 66, in spawn
return cls(src_name)
File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 36, in __init__
self.get_articles()
File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 81, in get_articles
self.process(queries)
File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 114, in process
processed_data = self._process(query, soup)
File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 91, in _process
config = SITE_CONF[query['source']['id']]
KeyError: None
I think the error is because I wasn't filtering based on whether it is newsapi compatible but I'd like to hear your opinion
2) We need to standardize the name of the stored URL. It's picking up previously scanned articles because the name mapping is all funky (where we replaced '/' with '==='). Can I leave this to you?
@IanQS
Coin telegraph really shouldn't be working at all since it is not part of NewsApi. It would only work if the NewsApi somehow returned a URL from coin telegraph during the query. Its strange because it seems that the NewsApi is still capable of returning results from a query of a source it does not support. We can fix this by making a query to the sources of NewsApi to check if the NewsApi supports it, if the site is not supported then we can instantiate the non-api scraper (when we implement it).
Sure, I can handle this. I will store URLs in their original form in seen_data.txt
files and npz
files will have '/' characters mapped to '%2F' since that's how they are encoded as per some standard idk.
Coin telegraph really shouldn't be working at all since it is not part of NewsApi. It would only work if the NewsApi somehow returned a URL from coin telegraph during the query. Its strange because it seems that the NewsApi is still capable of returning results from a query of a source it does not support. We can fix this by making a query to the sources of NewsApi to check if the NewsApi supports it, if the site is not supported then we can instantiate the non-api scraper (when we implement it).
Yeah, i thought to do that but to some extent it makes more sense to just iterate through the config and filter based on whether it is compatible with NewsApi
or not, instead of querying the site, then iterating through all of that and checking if it's compatible. Of course, to generate the config in the first place we'd need to query the site and manually check but ¯\(ツ)/¯ we'll make it future "us" problem
@IanQS
Test the scraper now. Change the limited
variable in launch_scrapers.py
to True/False
so that the scraper: halts when it sees a url already in seen_data.txt
/ continues running regardless.
Also note the change in how npz
files are saved now, i.e '/' characters are now mapped to '%2f'
Scraper in its current iteration might actually be "done", all there's left to do now is add more configs to site_configs.py
for a large scale test run. Everything else on top of it would be enhancements we can make at a later date. #27
After Bryan makes the changes, I want
@wrhm to review it and raise any questions. This will be a good "smell test" for our code to see if anything seems fishy to an "outsider" . The cleaner and easier to understand our code is the better
I want @wrhm to review it and raise any questions
@IanQS does "it" here refer to the full PR? Can do. 🙂
High-level inherited interface for news writing.
[x] Return type checking
[x] Read through ATBS to get a better idea of what abstractions should be present