dtrungtin / actor-booking-scraper

Actor for extracting data about hotels from Booking.com
https://apify.com/dtrungtin/booking-scraper
Apache License 2.0
17 stars 19 forks source link

Fix/scraped details #51

Closed lhotanok closed 2 years ago

lhotanok commented 2 years ago

Resolves #44, resolves #34

Circumvent results

useFilters option is currently implemented using the following logic (assuming useFilters === true):

Room info

Extraction of rooms info from detail page was added for unset checkIn and checkOut input attributes. Booking.com doesn't show room features directly inside rooms table without checkIn, checkOut set so it cannot be scraped effectively (I tried to expand room info using page.Click('.room-info [href]') combined with page.waitForSelector('.hprt-facilities-facility') (and a few other options) but the overhead was too big and a lot of timeouts were triggered. I added room info url to the output so it can be inspected if needed.)

Example room info:

{
      "url": "https://www.booking.com/hotel/us/zaza-dallas.cs.html?aid=304142;label=gen173nr-1FCAso7AFCC3phemEtZGFsbGFzSDNYBGhniAEBmAEFuAEYyAEM2AEB6AEB-AEGiAIBqAIEuAKYq_eNBsACAdICJGEyOWQzZmMwLTdmOTAtNDcxMS1iMTFiLTQyN2I0YjIxNjZiYdgCBeACAQ;sid=e3e5ca388d08ffa3d72b88094262cc35;dist=0&group_adults=2&group_children=0&hapos=22&hpos=22&keep_landing=1&nflt=review_score%3D84%3Bprice%3DUSD-150-200-1&no_rooms=1&req_adults=2&req_children=0&sb_price_type=total&sr_order=popularity&srepoch=1639830929&srpvid=da7758884dba0107&type=total&ucfs=1&#room_103228202",
      "roomType": "Deluxe Parlor Double",
      "bedType": "2 manželské postele",
      "persons": 2
}

Output schema

Output properties were updated for unset checkIn and checkOut input attributes. price, currency and persons properties were excluded as null was stored for each of them and resulting dataset was unnecessary bigger because of that.

metalwarrior665 commented 2 years ago

This is a lot of codes changes :) I'm not gonna do a deep review but generally looks good. I will merge it so we can start testing in beta.