flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
831 stars 179 forks source link

`HTTP errors` on `abstract_crawler.py` using docker #168

Closed y-71 closed 2 years ago

y-71 commented 2 years ago

I've tried to use setup a telegram bot using docker

I ended up using this config file since i just want to make it work now and don't want to handle filters yet:

---
# Enable verbose mode (print DEBUG log messages)
# verbose: true

# Should the bot endlessly looop through the URLs?
# Between each loop it waits for <sleeping_time> seconds.
# Note that Ebay will (temporarily) block your IP if you
# poll too often - don't lower this below 600 seconds if you
# are crawling Ebay.
loop:
    active: yes
    sleeping_time: 600

# Location of the Database to store already seen offerings
# Defaults to the current directory
#database_location: /path/to/database

# List the URLs containing your filter properties below.
# Currently supported services: www.immobilienscout24.de,
# www.immowelt.de, www.wg-gesucht.de, and www.ebay-kleinanzeigen.de.
# List the URLs in the following format:
# urls:
urls:
  - https://www.immobilienscout24.de/Suche/...
  - https://www.immowelt.de/...
  - https://www.ebay-kleinanzeigen.de/...
  - https://www.wg-gesucht.de/...

# Define filters to exclude flats that don't meet your critera.
# Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',
#   'max_price', 'min_price', and 'excluded_titles'.
#
# 'excluded_titles' takes a list of regex patterns that match against
# the title of the flat. Any matching titles will be excluded.
# More to Python regex here: https://docs.python.org/3/library/re.html
#
# Example:
# filters:
#   excluded_titles:
#     - "wg"
#     - "zwischenmiete"
#   min_price: 700
#   max_price: 1000
#   min_size: 50
#   max_size: 80
#   max_price_per_square: 1000

# There are often city districts in the address which
# Google Maps does not like. Use this blacklist to remove
# districts from the search.
# blacklist:
#   - Innenstadt

# If an expose includes an address, the bot is capable of
# displaying the distance and time to travel (duration) to
# some configured other addresses, for specific kinds of
# travel.
#  
# Available kinds of travel ('gm_id') can be found in the
# Google Maps API documentation, but basically there are:
#   - "bicycling"
#   - "transit" (public transport)
#   - "driving"
#   - "walking"
# 
# The example configuration below includes a place for
# "John", located at the main train station of munich.
# Two kinds of travel (bicycle and transit) are requested,
# each with a different label. Furthermore a place for
# "Jane" is included, located at the given destination and
# with the same kinds of travel.
# durations:
#     - name: John
#       destination: Hauptbahnhof, München
#       modes: 
#           - gm_id: transit
#             title: "Öff."
#           - gm_id: bicycling
#             title: "Rad"
#     - name: Jane
#       destination: Karlsplatz, München
#       modes: 
#           - gm_id: transit
#             title: "Öff."
#           - gm_id: driving
#             title: "Auto"

# Multiline message (yes, the | is supposed to be there), 
# to format the message received from the Telegram bot. 
# 
# Available placeholders:
#   - {title}: The title of the expose
#   - {rooms}: Number of rooms
#   - {price}: Price for the flat
#   - {durations}: Durations calculated by GMaps, see above
#   - {url}: URL to the expose
message: |
    {title}
    Zimmer: {rooms}
    Größe: {size}
    Preis: {price}
    Ort: {address}

    {url}

# Calculating durations requires access to the Google Maps API. 
# Below you can configure the URL to access the API, with placeholders.
# The URL should most probably just kept like that. 
# To use the Google Maps API, an API key is required. You can obtain one
# without costs from the Google App Console (just google for it).
# Additionally, to enable the API calls in the code, set the 'enable' key to True
google_maps_api:
    key: YOUR_API_KEY
    url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
    enable: False

# If you are planning to scrape immoscout24.de, the bot will need 
# to circumvent the sites captcha protection by using a captcha 
# solving service. Register at either imagetypers or 2captcha 
# (the former is prefered), desposit some funds, uncomment the 
# corresponding lines below and replace your API key/token.
# you will also have to install a Chrome Web Driver and write below 
# the executable path, the driver_arguments can be left as is.
# captcha:
#       imagetypers:
#             token: alskdjaskldjfklj
#       2captcha:
#             api_key: alskdjaskldjfklj
#       driver_path: YOUR_CHROME_DRIVER_PATH
#       driver_arguments:
#         - "--headless"

# You can select whether to be notified by telegram or via a mattermost
# webhook. For all notifiers selected here a configuration must be provided
# below.
# notifiers:
#   - telegram
#   - mattermost
notifiers:
    - telegram
    # - mattermost

# Sending messages using Telegram requires a Telegram Bot configured. 
# Telegram.org offers a good documentation about how to create a bot.
# Once you read it, will make sense. Still: bot_token should hold the
# access token of your bot and receiver_ids should list the client ids
# of receivers. Note that those receivers are required to already have
# started a conversation with your bot. 
#
telegram:
  bot_token: [a set of numbers which I suppose are the bot's ID]:[a token]
  receiver_ids:
    - [ my ID ]

# Sending messages via mattermost requires a webhook url provided by a
# mattermost server. You can find a description how to set up a webhook with
# the official mattermost documentation:
# https://docs.mattermost.com/developer/webhooks-incoming.html
# mattermost:
#   webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356

# If you are running the web interface, you can configure Login with Telegram support
# Follow the instructions here to register your domain with the Telegram bot:
# https://core.telegram.org/widgets/login
#
# website:
#    bot_name: bot_name_xxx
#    domain: flathunter.example.com
#    session_key: SomeSecretValue
#    listen:
#      host: 127.0.0.1
#      port: 8080

# If you are deploying to google cloud,
# uncomment this and set it to your project id. More info in the readme.
# google_cloud_project_id: my-flathunters-project-id

# For websites like idealista.it, there are anti-crawler measures that can be
# circumvented using proxies.
# use_proxy_list: True

the output is:

18:32:36|crawl_wggesucht.py|ERROR Got response (404): 
18:32:36|abstract_crawler.py|ERROR   ]: Got response (404): 
[2022/04/29 18:42:38|abstract_crawler.py|ERROR   ]:  Got response (404): 
[2022/04/29 18:42:38|abstract_crawler.py|ERROR   ]: Got response (500): 

the output is everytime the crawled page

and i don't receive any message on telegram

codders commented 2 years ago

The URLs in your config file appear to be invalid. In the urls section, you need to include links to the pages on WG-gesucht, immoscout and immowelt that you actually want to crawl.

Arthur

y-71 @.***> schrieb am Fr., 29. Apr. 2022, 19:43:

I've tried to use setup a telegram bot using docker

I ended up using this config file since i just want to make it work now and don't want to handle filters yet:


Enable verbose mode (print DEBUG log messages)

verbose: true

Should the bot endlessly looop through the URLs?

Between each loop it waits for seconds.

Note that Ebay will (temporarily) block your IP if you

poll too often - don't lower this below 600 seconds if you

are crawling Ebay.

loop:

active: yes

sleeping_time: 600

Location of the Database to store already seen offerings

Defaults to the current directory

database_location: /path/to/database

List the URLs containing your filter properties below.

Currently supported services: www.immobilienscout24.de,

www.immowelt.de, www.wg-gesucht.de, and www.ebay-kleinanzeigen.de.

List the URLs in the following format:

urls:

urls:

Define filters to exclude flats that don't meet your critera.

Supported filters include 'max_rooms', 'min_rooms', 'max_size', 'min_size',

'max_price', 'min_price', and 'excluded_titles'.

#

'excluded_titles' takes a list of regex patterns that match against

the title of the flat. Any matching titles will be excluded.

More to Python regex here: https://docs.python.org/3/library/re.html

#

Example:

filters:

excluded_titles:

- "wg"

- "zwischenmiete"

min_price: 700

max_price: 1000

min_size: 50

max_size: 80

max_price_per_square: 1000

There are often city districts in the address which

Google Maps does not like. Use this blacklist to remove

districts from the search.

blacklist:

- Innenstadt

If an expose includes an address, the bot is capable of

displaying the distance and time to travel (duration) to

some configured other addresses, for specific kinds of

travel.

#

Available kinds of travel ('gm_id') can be found in the

Google Maps API documentation, but basically there are:

- "bicycling"

- "transit" (public transport)

- "driving"

- "walking"

#

The example configuration below includes a place for

"John", located at the main train station of munich.

Two kinds of travel (bicycle and transit) are requested,

each with a different label. Furthermore a place for

"Jane" is included, located at the given destination and

with the same kinds of travel.

durations:

- name: John

destination: Hauptbahnhof, München

modes:

- gm_id: transit

title: "Öff."

- gm_id: bicycling

title: "Rad"

- name: Jane

destination: Karlsplatz, München

modes:

- gm_id: transit

title: "Öff."

- gm_id: driving

title: "Auto"

Multiline message (yes, the | is supposed to be there),

to format the message received from the Telegram bot.

#

Available placeholders:

- {title}: The title of the expose

- {rooms}: Number of rooms

- {price}: Price for the flat

- {durations}: Durations calculated by GMaps, see above

- {url}: URL to the expose

message: |

{title}

Zimmer: {rooms}

Größe: {size}

Preis: {price}

Ort: {address}

{url}

Calculating durations requires access to the Google Maps API.

Below you can configure the URL to access the API, with placeholders.

The URL should most probably just kept like that.

To use the Google Maps API, an API key is required. You can obtain one

without costs from the Google App Console (just google for it).

Additionally, to enable the API calls in the code, set the 'enable' key to True

google_maps_api:

key: YOUR_API_KEY

url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}

enable: False

If you are planning to scrape immoscout24.de, the bot will need

to circumvent the sites captcha protection by using a captcha

solving service. Register at either imagetypers or 2captcha

(the former is prefered), desposit some funds, uncomment the

corresponding lines below and replace your API key/token.

you will also have to install a Chrome Web Driver and write below

the executable path, the driver_arguments can be left as is.

captcha:

imagetypers:

token: alskdjaskldjfklj

2captcha:

api_key: alskdjaskldjfklj

driver_path: YOUR_CHROME_DRIVER_PATH

driver_arguments:

- "--headless"

You can select whether to be notified by telegram or via a mattermost

webhook. For all notifiers selected here a configuration must be provided

below.

notifiers:

- telegram

- mattermost

notifiers:

- telegram

# - mattermost

Sending messages using Telegram requires a Telegram Bot configured.

Telegram.org offers a good documentation about how to create a bot.

Once you read it, will make sense. Still: bot_token should hold the

access token of your bot and receiver_ids should list the client ids

of receivers. Note that those receivers are required to already have

started a conversation with your bot.

#

telegram:

bot_token: [a set of numbers which I suppose are the bot's ID]:[a token]

receiver_ids:

- [ my ID ]

Sending messages via mattermost requires a webhook url provided by a

mattermost server. You can find a description how to set up a webhook with

the official mattermost documentation:

https://docs.mattermost.com/developer/webhooks-incoming.html

mattermost:

webhook_url: https://mattermost.example.com/signup_user_complete/?id=abcdef12356

If you are running the web interface, you can configure Login with Telegram support

Follow the instructions here to register your domain with the Telegram bot:

https://core.telegram.org/widgets/login

#

website:

bot_name: bot_name_xxx

domain: flathunter.example.com

session_key: SomeSecretValue

listen:

host: 127.0.0.1

port: 8080

If you are deploying to google cloud,

uncomment this and set it to your project id. More info in the readme.

google_cloud_project_id: my-flathunters-project-id

For websites like idealista.it, there are anti-crawler measures that can be

circumvented using proxies.

use_proxy_list: True

the output is:

18:32:36|crawl_wggesucht.py|ERROR Got response (404):

18:32:36|abstract_crawler.py|ERROR ]: Got response (404):

[2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (404):

[2022/04/29 18:42:38|abstract_crawler.py|ERROR ]: Got response (500):

the output is everytime the crawled page

and i don't receive any message on telegram

— Reply to this email directly, view it on GitHub https://github.com/flathunters/flathunter/issues/168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEK5VAY3UXOQNN6XDLZ2TVHQUWFANCNFSM5UW5HWQQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>