PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
16.33k stars 1.59k forks source link

How to make Prefect work with Headless Selenium? #3609

Closed feliche93 closed 3 years ago

feliche93 commented 4 years ago

Description

I want to use Prefect for some automated scraping of my own social media stats such as posts, profile views on LinkedIn for example. With a lot of javascript and logging in, headless Selenium is the easiest solution so far.

When I run my code inside the file with flow.run() everything works out perfectly.

Registering the flow also works, but when I execute it in a local environment, the logs show following issue:

Unexpected error: TypeError("cannot pickle '_thread.lock' object")
Traceback (most recent call last):
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 881, in get_task_run_state
    result = self.result.write(value, **formatting_kwargs)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/results/local_result.py", line 116, in write
    value = self.serializer.serialize(new.value)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/serializers.py", line 70, in serialize
    return cloudpickle.dumps(value)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object

I believe the issue is that I am passing the driver object from one task to the next. However the selenium driver object cannot be pickled as the logs indicate.

Is there any way I can prevent serialising the driver object, and still use the driver (authenticated session) in other tasks? Or what would be a potential work around to make this work?

Expected Behavior

When running the following flow/task in the UI I would expect to not see any issues as when I execute flow.run() :

@task
def create_driver(headless=False):

    # setting options for headless state
    chrome_options = Options()
    if headless:
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument("--start-maximized")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument('--disable-extensions')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument("--headless")
        chrome_options.add_argument(
            "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36")
    driver = webdriver.Chrome(
        ChromeDriverManager().install(),
        options=chrome_options
    )

    return driver

Here's my flow:

with Flow("Linkedin Automation") as flow:

    headless = Parameter("headless", default=True)

    username = os.getenv("LINKEDIN_USERNAME")
    password = os.getenv("LINKEDIN_PASSWORD")

    driver = create_driver(headless)
    driver = login_linkedin(driver, username, password)

Environment

{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__FLOWS__CHECKPOINTING"
  ],
  "system_information": {
    "platform": "macOS-10.15.6-x86_64-i386-64bit",
    "prefect_backend": "server",
    "prefect_version": "0.13.13",
    "python_version": "3.8.2"
  }
}

Very much appreciate your help!

Thanks, Felix

0xjimm commented 3 years ago

I ran into this issue the other day, Dylan helped me out on the Slack channel.

I defined a local storage and set stored_as_script to True.

from prefect.environments.storage import Local

with Flow('Linkedin Automation") as flow:
    ...

flow.storage = Local(path='path/to/your/flow.py', stored_as_script=True)

flow.run()
feliche93 commented 3 years ago

@lejimmy thank you so much for helping me out on that :) Works like a charm!

Only thing I noticed is that when I split the tasks up into modules and import functions that I get the same error. So for now I guess I have to all tasks with returned driver objects in one file? Or did you also by any chance face this issue? :)

0xjimm commented 3 years ago

That’s what I’ve been doing.

Maybe saving your cookies can help you bypass some steps once you’ve already authenticated: https://stackoverflow.com/a/48665557

feliche93 commented 3 years ago

Answer here and in Slack thread: https://prefect-community.slack.com/archives/CL09KU1K7/p1603318809428700

cicdw commented 3 years ago

Archived the thread here for better discoverability: https://github.com/PrefectHQ/prefect/issues/3669