crwlrsoft / crawler

Library for Rapid (Web) Crawler and Scraper Development
https://www.crwlr.software/packages/crawler
MIT License
312 stars 11 forks source link

keep() instead of addToResult() and sub crawlers #142

Closed otsch closed 1 month ago

otsch commented 4 months ago

New methods Step::keep(), Step::keepAs(), Step::keepFromInput() and Step::keepInputAs() as simpler alternatives for Step::addToResult(), Step::addLaterToResult() and Step::keepInputData() which are all deprecated now. The new keep methods add data to a keep array in IO objects. Not creating a Result object and potentially sharing the same Result object for a lot of child outputs, makes the new keep functionality less complex. No need for something like addLaterToResult(). Kept properties can also be used with useInputKey() which is pretty handy.

Another cool new feature are sub crawlers. Any step can now create a sub crawler to fill a property. Example: you have a page about an author with multiple links to detail pages about his books. You can select those links and let a sub crawler fill the author's books property with data from the book detail pages.

Further also introduce a new Step::outputType() method, that returns if a certain step yields outputs that are associate arrays (or objects), scalar values or potentially both (mixed). This helps reduce potential critical problems during a crawler run by validating before the run and throwing an exception (or log error messages).