NikolaiT / se-scraper

Javascript scraping module based on puppeteer for many different search engines...
https://scrapeulous.com/
Apache License 2.0
538 stars 123 forks source link

Docker support #20

Closed slotix closed 4 years ago

slotix commented 5 years ago

Awsome module! Do you plan to build "se-scraper" docker image? Thank you.

NikolaiT commented 5 years ago

I dont have much experience with Docker, but I will look into it very soon and add such a Docker Image.

slotix commented 5 years ago

I'm going to add docker image myself and share with the community.

snork-alt commented 5 years ago

I'm getting this error while trying to launch se_scraper in docker

UnhandledPromiseRejectionWarning: Error: Unable to launch browser for worker, error message: Failed to launch chrome!
[0514/035629.769555:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

Are you facing the same issue ?

kederrac commented 5 years ago

use this config or the one from commit

 let config = {
            random_user_agent: true,
            write_meta_data: true,
            sleep_range: "",
            chrome_flags:[ ],
            search_engine: searchEngine,
            debug: false,
            verbose: false,
            keywords: keys,
            num_pages: num_pages,
            headless: true,
            puppeteer_cluster_config:{
                    timeout:600000,
                    monitor:false,
                    concurrency:1,
                    maxConcurrency:1
                }
        };
slotix commented 5 years ago

Try to add "--no-sandbox" parameter to "chrome_flags"

curl -XPOST http://0.0.0.0:3000 -H 'Content-Type: application/json' \
-d '{
    "user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
    "random_user_agent":true,
    "sleep_range":"",
"search_engine":"baidu",
    "debug":true,
    "verbose":true,
    "keywords":[ "cat",  "mouse" ],
    "keyword_file":"",
    "num_pages":1,
    "headless":true,
    "chrome_flags":["--no-sandbox" ],
    "output_file":"examples/results/baidu.json",
    "block_assets":false,
    "custom_func":"",
    "proxy":"",
    "proxy_file":"",
    "test_evasion":false,
    "apply_evasion_techniques":true,
    "log_ip_address":false,
    "log_http_headers":false,
    "puppeteer_cluster_config":{
        "timeout":600000,
        "monitor":false,
        "concurrency":1,
        "maxConcurrency":1
    }
}'
ghost commented 4 years ago

i take the image from docker-hub. the port of docker image (on localhost ip same as the direct container ip) is closed. with nmap scan same result.

slotix commented 4 years ago

pull the latest version from docker hub run docker run -it -e HOST=0.0.0.0 -e PORT=3000 -p 3000:3000 slotix/se-scraper try another port instead of 3000

ghost commented 4 years ago

thx @slotix for the fast help. new image runs and i get a request. if the env's HOST and PORT necassary, you should edit the pull request.

thx a lot

slotix commented 4 years ago

@Axel-G updated pull request

tobiasmuehl commented 4 years ago

Can we close this?