Open betizad opened 9 months ago
@betizad - thanks for creating this issue!
Have you tried skipping the install of the connector? PyAirbyte is able to install your connectors in their own dedicated virtual environments and it does this by default in order to prevent version conflicts.
Alternatively, you can use a tool like pipx
to install your connector if it's available. Pipx is a drop-in replacement for pip
, but I haven't used it before in a notebook environment so I can't say for sure if it would work in your case.
I tried leeting airbyte to install the library needed, but it did not work. I get the following error:
source = airbyte.get_source('source-linkedin-ads', version="0.7.0")
AirbyteSubprocessFailedError: AirbyteSubprocessFailedError: Subprocess failed.
Run Args: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cb8b9359-8554-45c6-bb46-ae96bbd591dd/bin/python', '-m', 'venv', '/home/spark-cb8b9359-8554-45c6-bb46-ae/.venv-source-linkedin-ads']
Exit Code: 1
My current workaround is:
LINKEDIN_EXEC = subprocess.Popen("which source-linkedin-ads", shell=True, stdout=subprocess.PIPE).stdout.read().decode().replace("\n","")
source = airbyte.get_source('source-linkedin-ads', local_executable=LINKEDIN_EXEC)
@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the venv
CLI is not findable.
I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.
(Updated the title of this issue to reflect what I now think is the root cause.)
Can you provide the specifics to your runtime?
And can you try the workaround which we applied to Colab?
In Colab, our examples like this one start with !apt-get install -qq python3.10-venv
. I'm not confident this same workaround would work on Databricks, but it seems worth trying.
I'm running into a similar problem on another platform (Replit).
Replit is built on Nix and I suspect there are some permissions / config issues with trying to install venvs into the project folder.
ERROR: Can not perform a '--user' install. User site-packages are not visible in this virtualenv.
My workaround:
SOURCE_GOOGLE_SHEETS = "source-google-sheets"
source = ab.get_source(
name=SOURCE_GOOGLE_SHEETS,
local_executable=f".pythonlibs/bin/{SOURCE_GOOGLE_SHEETS}"
)
source.set_config({
"credentials": {
"auth_type": "Service",
"service_account_info": os.environ["SERVICE_ACCOUNT_JSON"]
},
"spreadsheet_id": SPREADSHEET_ID
})
Of course, that presents it's own challenges because now there are dependency issues π
Would love to find a solution for environments with challenging venv configurations.
@betizad - I think I see the issue here. From the logs, I see you are running in Databricks/Spark, and their runtime apparently does not support the venv library - or the
venv
CLI is not findable.I'm glad to hear you have a temporary workaround, but we'd still like to find a solution that works for Databricks users broadly.
(Updated the title of this issue to reflect what I now think is the root cause.)
Can you provide the specifics to your runtime?
And can you try the workaround which we applied to Colab?
In Colab, our examples like this one start with
!apt-get install -qq python3.10-venv
. I'm not confident this same workaround would work on Databricks, but it seems worth trying.
I took me a while to get back to this.
I'm using: DBR 13.3LTS Python '3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]' pyAirbyte 0.10.4 airbyte-source-linkedin-ads 2.0.0
The workaround in colab does not work in DBX. If I run !apt-get install -qq python3.10-venv
I get a permission error:
E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
@betizad and @mattppal - Thank you both for sharing more about your context and execution requirements.
I did a bit of digging (mostly ChatGPT π) and I believe I've confirmed that in both the Spark and also the Replit runtimes, there is no ability to create an 'isolated virtual env' - which we would need to ensure proper dependency isolation.
If we don't want to roll the dice on a per-connector basis about whether the connectors will have conflicts with each other and/or with PyAirbyte or other libraries that you are using in these environments, I can think of two decent paths forward:
This requires net new work on the side of Airbyte, and it would (probably?) also require some work from the user in terms of interacting with Conda or building a Conda environment.
This has an added benefit of streamlining usage in other environments that have Conda-based delivery integration - for instance with Snowflake's Snowpark Python runtime.
In this approach, we would design a process to build connectors into CLI executables - and the executable itself would handle delivery of dependencies and the needed environment isolation.
I believe this would work well in the case of Replit, where the executable would be uploaded to the Replit environment and then invoked/called by PyAirbyte. But getting this working correctly in a Spark cluster could be more complicated - since you'd need to ensure the CLI executable is available to all nodes in the cluster. (Not impossible, but also probably not a trivial effort.)
@betizad and @mattppal - I'm curious of your thoughts on both of these approaches. Let me know if one or both seem like they could be a good fit, and/or if you have any other ideas not mentioned above.
Thanks! π
Circling back to this thread - A few other runtimes have been requested since my last post.
Cethan in Slack has reported difficulty deploying with the www.render.com and separately we've had some progress getting this to work with Airflow.
The trick that worked in Airflow was to use a Dockerfile that handles the isolation of installing the connectors into their own virtualenvs:
# Pre-install the connnector(s) in their own virtualenv
RUN python -m venv source_github && source source_github/bin/activate &&\
pip install --no-cache-dir airbyte-source-github && deactivate
# ... repeat for other connectors ...
# Test that the executable works and we can find it
RUN source/bin/source-github spec
# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
pip install --no-cache-dir airbyte==0.10.4 && deactivate
If pipx
is preinstalled on the image, this is slightly easier:
# pipx handles the virtual-env and auto-adds the connector CLI to PATH:
RUN pipx install airbyte-source-github
RUN pipx install airbyte-source-faker
# Test that the executables work and we can find them on PATH
RUN source-github spec
RUN source-faker spec
# Go ahead and install PyAirbyte as usual
RUN python -m venv pyairbyte_venv && source pyairbyte_venv/bin/activate &&\
pip install --no-cache-dir airbyte==0.10.4 && deactivate
Hello, @betizad, @mattppal -
Circling back here again. π
Very happy to announce that we have a new "yaml" installation option that works for ~135 different API source connectors - along with all custom connectors built with our no-code Connector Builder. We're also investing heavily in migrating python connectors to the no-code/low-code framework, which means the number of supported connectors will continue to grow.
Here is a Loom I recorded to walk through the feature:
Compiling a related list of docs and resources here:
When I try to install airbyte and airbyte-source-linkedin-ads, I get the following error.
I install in databaricks using the command
%pip install airbyte==0.7.2 airbyte-source-linkedin-ads==0.7.0
When I do the same in a local machine, the linkedin-ads is installed in a new venv whcih does not work in databricks.