D4Vinci / Scrapling

Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
BSD 3-Clause "New" or "Revised" License
1.51k stars 72 forks source link

Releasing version 0.2 (BIG CHANGES) #3

Closed D4Vinci closed 2 weeks ago

D4Vinci commented 2 weeks ago

What's changed

New features

  1. Introducing the Fetchers feature with 3 new main types to make Scrapling fetch pages for you with a LOT of options!
    • The Fetcher class for basic HTTP requests
    • The StealthyFetcher class is a completely stealthy fetcher that uses a stealthy modified version of Firefox.
    • The PlayWrightFetcher class that allows doing browser-based requests with Vanilla PlayWright, PlayWright with stealth mode made by me, Real browsers through CDP, and NSTBrowser's docker browserless!
  2. Added the completely new find_all/find methods to find elements easily on the page with dark magic!
  3. Added the methods filter and search to the Adaptors class for easier bulk operations on Adaptor object groups.
  4. Added methods css_first and xpath_first methods for easier usage.
  5. Added the new class type TextHandlers which is used for bulk operations on TextHandler objects like the Adaptors class.
  6. Added generate_full_css_selector and generate_full_xpath_selector methods.

Bugs Squashed

  1. Now the Adaptors class version of re_first returns the first result that matches in all Adaptor objects inside instead of the faulty logic of returning the results of re_first of all Adaptor objects.
  2. Now if the user selects a text-type content to be returned from selected elements (like css ::text function) with any method like .css or .xpath. The Adaptor object will return the TextHandlers class instead of returning a list of strings like before. So now you can do page.css('something::text').re_first(r'regex_pattern').json() instead of page.css('something::text')[0].re_first(r'regex_pattern').json()
  3. Now Adaptor/Adaptors re/re_first arguments are consistent with the TextHandler ones. So now you have clean_match and case_sensitive arguments.
  4. Now the auto_match argument is enabled by default in the initialization of Adaptor but still you have to enable it while selecting elements if you want to enable it. (Not a bug but a design decision)
  5. A lot of type-annotations corrections here and there for better auto-completion experience while you are coding with Scrapling.

Quality of life changes

  1. Renamed both css_selector and xpath_selector methods to generate_css_selector and generate_xpath_selector for clarity and to not interrupt the auto-completion while coding.
  2. Restructured most of the old code into a core subpackage and other design decisions for cleaner and easier maintenance in the future.
  3. Restructured the tests folder into a cleaner structure and added tests for the new features. Also now tox environments are cached on GitHub for faster automated tests with each commit.