adriankumpf / findmeaflat

Get notified of new listings on popular German real estate portals.
22 stars 6 forks source link

Howoge crawler? #50

Closed pcace closed 1 year ago

pcace commented 2 years ago

Hi there, i am starting with java and tried to understand how the crawler works. so i tried to make a howoge crawler. but i am really not so sure how i would start it. maybe you could help me with that?

here is an example link:

https://www.howoge.de/wohnungen-gewerbe/wohnungssuche.html?tx_howsite_json_list%5Bpage%5D=1&tx_howsite_json_list%5Blimit%5D=12&tx_howsite_json_list%5Blang%5D=&tx_howsite_json_list%5Bkiez%5D%5B%5D=Marzahn&tx_howsite_json_list%5Bkiez%5D%5B%5D=99&tx_howsite_json_list%5Bkiez%5D%5B%5D=Buch&tx_howsite_json_list%5Bkiez%5D%5B%5D=Alt-Hohensch%C3%B6nhausen&tx_howsite_json_list%5Bkiez%5D%5B%5D=Neu-Hohensch%C3%B6nhausen&tx_howsite_json_list%5Bkiez%5D%5B%5D=Fennpfuhl&tx_howsite_json_list%5Bkiez%5D%5B%5D=Alt-Lichtenberg&tx_howsite_json_list%5Bkiez%5D%5B%5D=Friedrichsfelde&tx_howsite_json_list%5Bkiez%5D%5B%5D=Karlshorst&tx_howsite_json_list%5Bkiez%5D%5B%5D=Treptow-K%C3%B6penick&tx_howsite_json_list%5Bkiez%5D%5B%5D=Pankow&tx_howsite_json_list%5Brent%5D=900&tx_howsite_json_list%5Barea%5D=70&tx_howsite_json_list%5Brooms%5D=2&tx_howsite_json_list%5Bwbs%5D=all-offers

i then created a howoge.js in the sources folder with this howoge object:

const howoge = {
  name: 'howoge',
  enabled,
  url: !enabled || config.providers.howoge.url,
  crawlContainer: '#immoobject-list',
  crawlFields: {
    id: '.aditem@data-adid | int',
    price: '.div:nth-child(1)  div  div.content  div.row  div:nth-child(1) div div:nth-child(1) div.attributes-content.color-secondary  | removeNewline | trim',
    size: '.div:nth-child(1) div div.content div.row div:nth-child(1) div div:nth-child(2) div.attributes-content | removeNewline | trim',
    title: '.div:nth-child(1) div div.content div.notice | removeNewline | trim',
    link: '.div:nth-child(1) div div.content div.address a@href | removeNewline | trim',
    description: '.div:nth-child(1) div div.content div.notice | removeNewline | trim',
    address: '.div:nth-child(2) div div.content div.address a | removeNewline | trim',
    rooms: '.div:nth-child(2) div div.content div.row div:nth-child(1) div div.wrap-xs.d-md-none div div.attributes-content | removeNewline | trim',
  },
  paginate: 'div:nth-child(6) div div div:nth-child(2) div:nth-child(3) div ul li.pagination--page-next a@href',
  normalize: normalize,
  filter: applyBlacklist,
}

i now dont really know how i would find out the correct selectors for the properties. my way was to try "copy selector" within chrome:

Bildschirmfoto 2021-12-27 um 22 05 46

so for example i came up with this selector in chrome for 'price':

#immoobject-list > div:nth-child(4) > div > div.content > div.row > div:nth-child(1) > div > div:nth-child(1) > div.attributes-content.color-secondary

and converted it to what you have used in the other crawlers:

'.div:nth-child(1) div div.content div.row div:nth-child(1) div div:nth-child(1) div.attributes-content.color-secondary | removeNewline | trim',

so here is what i dont understand:

Thank you so much in advance for help, and sorry for asking so dumb questions - i am just starting to learn java...

Cheers

adriankumpf commented 2 years ago

Hi @pcace,

having a crawler for Howoge would be pretty cool!

Getting the CSS selectors requires a bit of manual work. Copying the selector is usually a good starting point. In general, use as many classes or IDs as possible and if possible don't depend on the position of an element with its parent.

In the example you gave, getting the price of a listing could look like this:

#immoobject-list .content .row .attributes .attributes-content.color-secondary

I like using the browser console to experiment and figure out the best selector:

> document.querySelector("#immoobject-list .content .row .attributes .attributes-content.color-secondary").textContent
"
                                    691,60 €
                                "

The scraper is based on x-ray. It uses | to define filters: https://github.com/matthewmueller/x-ray#filters

removeNewline and trim are custom filters that are defined here: https://github.com/adriankumpf/findmeaflat/blob/master/lib/scraper.js#L5

When debugging, of course, it depends on where the problem is. Often a few console.log statements help to find out why e.g. the price is not read out correctly.

I hope this helps!