Open kwuite opened 6 years ago
@gaojiuli,I see we cannot install this project from pypi because that version is flawed. Can I expect this project to accept PR's or should I continue with my own fork?
For people reading this message, here is my fork of the project
Changes
cache_disabled_urls
in case you are daily scraping a feed or blog and you are only interested in the new urls. Just set cache_enabled=True
and run the docker instance that I have mentioned in the README.md file.test
flag so you can test a single parsed page to improve your css or figure out issues with your code. Together with the cache feature this makes a very quick and reliable trial and error cycle. Increase the amount of tests to be run with the max_requests
limit_requests
flag you can adjust the max_requests value to actually limit the maximum amount of external requests.Extraction
With the new Css class the following can be done:
In case you need to cleanup data or extract stuff like a phone number or email address, the following manipulate options are available:
The order in which you supply manipulate options, is the order of execution so you can actually combine these manipulations.
And many more will be added in the future. I have written tests for all features so take a look at this file if you are interested.
With the current version on my dev branch you can:
I hope @gaojiuli , is interested in the way I moved forward with this project so we can merge our code once I am satisfied with a production version. I kept the philosophy of creating a scraper for everyone and with that in mind I changed the way we extract data.
Accept pr.
@gaojiuli, great news, happy that I can share my code.
Before the PR, the following I have to do:
For the item.py I have a question regarding this code:
if hasattr(self, 'save_url'):
self.url = getattr(self, 'save_url')
else:
self.url = "file:///tmp.data"
Are you using this code or this junk code that can be removed?
(Welcome any kind of optimization)
re.findall issue
I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.
So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of abstract_url.py
Version 0.1.4 of this project catches this as response:
re.findall
returns what is requested by your regex but not what is matched!Test incorrect
The base url http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.
re.match
I rewrote function abstract_url to:
and now this is the result of abstract_url:
This test: tests/test_parse_multiple_items.py now fails as it should.