News sources compatibility

IanQS commented 6 years ago

High-level inherited interface for news writing.

[x] Return type checking
[x] Read through ATBS to get a better idea of what abstractions should be present

IanQS commented 6 years ago

@bryantpq Awesome job overall!

I've reviewed, and added the threading + handled the abstraction. I'd say all that's remaining now is:

[x] for you to address the things I tagged you in (scan this page and open the "show outdated links").
- Github considers things "handled" once the line where the comment was made has been modified
[x] for you to do a quick review (on top of my review)
[ ] for me to test it really quick

and we can merge this + move on to other things :)

bryantpq commented 6 years ago

@IanQS The scraper should be working. To run it (at least how i've been doing it):

Ensure your src/constants.py and src/news_api_key.py have been set up.
Activate your venv.
Change current directory to krak_trader/src
Run python3 -m kraken_scraper.news_data.news_api_scraper

If you know any other ways to run the scraper then lmk.

IanQS commented 6 years ago

@bryantpq

Sweet! Was just making sure things were running. Things are going well now. Some final things to address

1) Coin Telegraph errors: (CCN and techcrunch work fine)

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 66, in spawn
    return cls(src_name)
  File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 36, in __init__
    self.get_articles()
  File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 81, in get_articles
    self.process(queries)
  File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 114, in process
    processed_data = self._process(query, soup)
  File "/home/itq/Desktop/krak_trader/src/kraken_scraper/news_data/news_api_scraper.py", line 91, in _process
    config = SITE_CONF[query['source']['id']]
KeyError: None

I think the error is because I wasn't filtering based on whether it is newsapi compatible but I'd like to hear your opinion

2) We need to standardize the name of the stored URL. It's picking up previously scanned articles because the name mapping is all funky (where we replaced '/' with '==='). Can I leave this to you?

bryantpq commented 6 years ago

@IanQS

Coin telegraph really shouldn't be working at all since it is not part of NewsApi. It would only work if the NewsApi somehow returned a URL from coin telegraph during the query. Its strange because it seems that the NewsApi is still capable of returning results from a query of a source it does not support. We can fix this by making a query to the sources of NewsApi to check if the NewsApi supports it, if the site is not supported then we can instantiate the non-api scraper (when we implement it).
Sure, I can handle this. I will store URLs in their original form in seen_data.txt files and npz files will have '/' characters mapped to '%2F' since that's how they are encoded as per some standard idk.

IanQS commented 6 years ago

Coin telegraph really shouldn't be working at all since it is not part of NewsApi. It would only work if the NewsApi somehow returned a URL from coin telegraph during the query. Its strange because it seems that the NewsApi is still capable of returning results from a query of a source it does not support. We can fix this by making a query to the sources of NewsApi to check if the NewsApi supports it, if the site is not supported then we can instantiate the non-api scraper (when we implement it).

Yeah, i thought to do that but to some extent it makes more sense to just iterate through the config and filter based on whether it is compatible with NewsApi or not, instead of querying the site, then iterating through all of that and checking if it's compatible. Of course, to generate the config in the first place we'd need to query the site and manually check but ¯\(ツ)/¯ we'll make it future "us" problem

bryantpq commented 6 years ago

@IanQS

Test the scraper now. Change the limited variable in launch_scrapers.py to True/False so that the scraper: halts when it sees a url already in seen_data.txt / continues running regardless.

Also note the change in how npz files are saved now, i.e '/' characters are now mapped to '%2f'

bryantpq commented 6 years ago

Scraper in its current iteration might actually be "done", all there's left to do now is add more configs to site_configs.py for a large scale test run. Everything else on top of it would be enhancements we can make at a later date. #27

IanQS commented 6 years ago

After Bryan makes the changes, I want

@wrhm to review it and raise any questions. This will be a good "smell test" for our code to see if anything seems fishy to an "outsider" . The cleaner and easier to understand our code is the better

wrhm commented 6 years ago

I want @wrhm to review it and raise any questions

@IanQS does "it" here refer to the full PR? Can do. 🙂

IanQS / krak_trader

News sources compatibility #13