🐛 Bug: Cannot install connectors on Databricks/Spark (also Render.com and Replit.com)

betizad commented 9 months ago

When I try to install airbyte and airbyte-source-linkedin-ads, I get the following error.

INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of airbyte to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install airbyte-source-linkedin-ads==0.7.0 and airbyte==0.7.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    airbyte 0.7.2 depends on airbyte-cdk<0.59.0 and >=0.58.3
    airbyte-source-linkedin-ads 0.7.0 depends on airbyte-cdk==0.63.2

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

I install in databaricks using the command %pip install airbyte==0.7.2 airbyte-source-linkedin-ads==0.7.0

When I do the same in a local machine, the linkedin-ads is installed in a new venv whcih does not work in databricks.

aaronsteers commented 9 months ago

@betizad - thanks for creating this issue!

Have you tried skipping the install of the connector? PyAirbyte is able to install your connectors in their own dedicated virtual environments and it does this by default in order to prevent version conflicts.

aaronsteers commented 9 months ago

Alternatively, you can use a tool like pipx to install your connector if it's available. Pipx is a drop-in replacement for pip, but I haven't used it before in a notebook environment so I can't say for sure if it would work in your case.

betizad commented 9 months ago

I tried leeting airbyte to install the library needed, but it did not work. I get the following error:

source = airbyte.get_source('source-linkedin-ads', version="0.7.0")

AirbyteSubprocessFailedError: AirbyteSubprocessFailedError: Subprocess failed.
    Run Args: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cb8b9359-8554-45c6-bb46-ae96bbd591dd/bin/python', '-m', 'venv', '/home/spark-cb8b9359-8554-45c6-bb46-ae/.venv-source-linkedin-ads']
    Exit Code: 1

My current workaround is:

first install the source-linkedin-ads, then airbyte.
Ignore the warning and errors.
then get the executable for linkedin LINKEDIN_EXEC = subprocess.Popen("which source-linkedin-ads", shell=True, stdout=subprocess.PIPE).stdout.read().decode().replace("\n","")
us the executable instead of installing automatically: source = airbyte.get_source('source-linkedin-ads', local_executable=LINKEDIN_EXEC)

aaronsteers commented 9 months ago

@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the venv CLI is not findable.

I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.

(Updated the title of this issue to reflect what I now think is the root cause.)

Can you provide the specifics to your runtime?

And can you try the workaround which we applied to Colab?

In Colab, our examples like this one start with !apt-get install -qq python3.10-venv. I'm not confident this same workaround would work on Databricks, but it seems worth trying.

mattppal commented 7 months ago

I'm running into a similar problem on another platform (Replit).

Replit is built on Nix and I suspect there are some permissions / config issues with trying to install venvs into the project folder.

ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.

My workaround:

  SOURCE_GOOGLE_SHEETS = "source-google-sheets"

  source = ab.get_source(
      name=SOURCE_GOOGLE_SHEETS,
      local_executable=f".pythonlibs/bin/{SOURCE_GOOGLE_SHEETS}"
  )

  source.set_config({
      "credentials": {
          "auth_type": "Service",
          "service_account_info": os.environ["SERVICE_ACCOUNT_JSON"]
      },
      "spreadsheet_id": SPREADSHEET_ID
  })

Of course, that presents it's own challenges because now there are dependency issues 😅

Would love to find a solution for environments with challenging venv configurations.

betizad commented 7 months ago

@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the venv CLI is not findable.

I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.

(Updated the title of this issue to reflect what I now think is the root cause.)

Can you provide the specifics to your runtime?

And can you try the workaround which we applied to Colab?

In Colab, our examples like this one start with !apt-get install -qq python3.10-venv. I'm not confident this same workaround would work on Databricks, but it seems worth trying.

I took me a while to get back to this.

I'm using: DBR 13.3LTS Python '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]' pyAirbyte 0.10.4 airbyte-source-linkedin-ads 2.0.0

The workaround in colab does not work in DBX. If I run !apt-get install -qq python3.10-venv I get a permission error:

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?

aaronsteers commented 7 months ago

@betizad and @mattppal - Thank you both for sharing more about your context and execution requirements.

I did a bit of digging (mostly ChatGPT 🙄) and I believe I've confirmed that in both the Spark and also the Replit runtimes, there is no ability to create an 'isolated virtual env' - which we would need to ensure proper dependency isolation.

If we don't want to roll the dice on a per-connector basis about whether the connectors will have conflicts with each other and/or with PyAirbyte or other libraries that you are using in these environments, I can think of two decent paths forward:

Option 1: Leverage Conda across connectors and PyAirbyte to align dependency versions

This requires net new work on the side of Airbyte, and it would (probably?) also require some work from the user in terms of interacting with Conda or building a Conda environment.

This has an added benefit of streamlining usage in other environments that have Conda-based delivery integration - for instance with Snowflake's Snowpark Python runtime.

Option 2: Use a tool like Shiv or PyOxydizer to pre-build the connector executable

In this approach, we would design a process to build connectors into CLI executables - and the executable itself would handle delivery of dependencies and the needed environment isolation.

I believe this would work well in the case of Replit, where the executable would be uploaded to the Replit environment and then invoked/called by PyAirbyte. But getting this working correctly in a Spark cluster could be more complicated - since you'd need to ensure the CLI executable is available to all nodes in the cluster. (Not impossible, but also probably not a trivial effort.)

@betizad and @mattppal - I'm curious of your thoughts on both of these approaches. Let me know if one or both seem like they could be a good fit, and/or if you have any other ideas not mentioned above.

Thanks! 🙏

aaronsteers commented 7 months ago

Circling back to this thread - A few other runtimes have been requested since my last post.

Cethan in Slack has reported difficulty deploying with the www.render.com and separately we've had some progress getting this to work with Airflow.

The trick that worked in Airflow was to use a Dockerfile that handles the isolation of installing the connectors into their own virtualenvs:

# Pre-install the connnector(s) in their own virtualenv
RUN python -m venv source_github && source source_github/bin/activate &&\
    pip install --no-cache-dir airbyte-source-github && deactivate

# ... repeat for other connectors ...

# Test that the executable works and we can find it
RUN source/bin/source-github spec

# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
    pip install --no-cache-dir airbyte==0.10.4 && deactivate

If pipx is preinstalled on the image, this is slightly easier:

# pipx handles the virtual-env and auto-adds the connector CLI to PATH:
RUN pipx install airbyte-source-github
RUN pipx install airbyte-source-faker

# Test that the executables work and we can find them on PATH
RUN source-github spec
RUN source-faker spec

# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
    pip install --no-cache-dir airbyte==0.10.4 && deactivate

aaronsteers commented 6 months ago

Hello, @betizad, @mattppal -

Circling back here again. 👋

Very happy to announce that we have a new "yaml" installation option that works for ~135 different API source connectors - along with all custom connectors built with our no-code Connector Builder. We're also investing heavily in migrating python connectors to the no-code/low-code framework, which means the number of supported connectors will continue to grow.

Here is a Loom I recorded to walk through the feature:

Exploring PyAirbyte Declarative YAML Sources 🚀 - Watch Video

aaronsteers commented 2 weeks ago

Compiling a related list of docs and resources here:

https://github.com/airbytehq/PyAirbyte/issues/543

airbytehq / PyAirbyte

🐛 Bug: Cannot install connectors on Databricks/Spark (also Render.com and Replit.com) #78

Option 1: Leverage Conda across connectors and PyAirbyte to align dependency versions

Option 2: Use a tool like Shiv or PyOxydizer to pre-build the connector executable