useFilters option is currently implemented using the following logic (assuming useFilters === true):
New start pages are enqueued in handleListPage function, pagination pages are enqueued only if total number of results per current start url is <= 1000. Otherwise filtered pages are enqueued for results count > 1000.
New filters are detected from unchecked checkboxes, they are mapped to the current url as new query parameters.
In each enqueuing phase triggered in handleListPage function, unchecked filters are iterated and each filter is interpreted as query parameter name. If the filter has multiple value choices, all values are iterated and new url is enqueued for each value.
Before a new filtered page is enqueued, it is checked against duplicate addition. All urls enqueued using filters are stored in state object and newly built url is checked against all stored urls. If an url with exactly same query parameter names is detected, the new url is not enqueued. Query parameter values don't have to match precisely in this url comparsion as all values of a given parameter are processed during 1 filter enqueuing phase.
Room info
Extraction of rooms info from detail page was added for unset checkIn and checkOut input attributes. Booking.com doesn't show room features directly inside rooms table without checkIn, checkOut set so it cannot be scraped effectively (I tried to expand room info using page.Click('.room-info [href]') combined with page.waitForSelector('.hprt-facilities-facility') (and a few other options) but the overhead was too big and a lot of timeouts were triggered. I added room info url to the output so it can be inspected if needed.)
Output properties were updated for unset checkIn and checkOut input attributes. price, currency and persons properties were excluded as null was stored for each of them and resulting dataset was unnecessary bigger because of that.
Resolves #44, resolves #34
Circumvent results
useFilters
option is currently implemented using the following logic (assuminguseFilters === true
):handleListPage
function, pagination pages are enqueued only if total number of results per current start url is <= 1000. Otherwise filtered pages are enqueued for results count > 1000.handleListPage
function, unchecked filters are iterated and each filter is interpreted as query parameter name. If the filter has multiple value choices, all values are iterated and new url is enqueued for each value.state
object and newly built url is checked against all stored urls. If an url with exactly same query parameter names is detected, the new url is not enqueued. Query parameter values don't have to match precisely in this url comparsion as all values of a given parameter are processed during 1 filter enqueuing phase.Room info
Extraction of rooms info from detail page was added for unset
checkIn
andcheckOut
input attributes. Booking.com doesn't show room features directly inside rooms table withoutcheckIn
,checkOut
set so it cannot be scraped effectively (I tried to expand room info usingpage.Click('.room-info [href]')
combined withpage.waitForSelector('.hprt-facilities-facility')
(and a few other options) but the overhead was too big and a lot of timeouts were triggered. I added room info url to the output so it can be inspected if needed.)Example room info:
Output schema
Output properties were updated for unset
checkIn
andcheckOut
input attributes.price
,currency
andpersons
properties were excluded asnull
was stored for each of them and resulting dataset was unnecessary bigger because of that.