dgarnitz / vectorflow

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
https://www.getvectorflow.com/
Apache License 2.0
663 stars 47 forks source link

Add integration to salesforce #16

Open dgarnitz opened 1 year ago

dgarnitz commented 1 year ago

Vectorflow should be able to ingest raw data from Salesforce.

Some open questions to explore prior to implementation:

asadnhasan commented 11 months ago

Thanks for raising this feature request. Ingesting data from Salesforce into VectorFlow is definitely something we should explore supporting. Here are some initial thoughts on your questions:

Regarding ingestion - It looks like the Salesforce REST API provides options for exporting data in JSON, XML, and CSV formats. I think the best approach would be to build a separate lightweight ingestion worker specifically for Salesforce data. This worker could handle authentication with Salesforce using OAuth, make API calls to export data, do any needed parsing/validation, and then pass the transformed data to VectorFlow's main ingestion pipeline.

For security - OAuth should allow secure authentication for the ingestion worker to access Salesforce. We can encrypt any credentials stored in configuration. Restricting the worker to only access the needed Salesforce data exports will also be important.

Suggested file formats - The Salesforce API supports JSON, XML and CSV. CSV may be the easiest to work with in VectorFlow if we can get full data exports. For more targeted exports, JSON or XML may be required. Some parsing would be needed in the ingestion worker before passing data to VectorFlow in a supported format like Parquet.

Possible Next Steps, which can be further worked on:

Exploring Salesforce OAuth authentication flows for the ingestion worker Test sample data exports from Salesforce API in JSON, XML and CSV Prototype basic ingestion worker to extract sample export, parse data, and write to Parquet Evaluation of how exported data maps to VectorFlow's expected input schema (Important)

dgarnitz commented 11 months ago

How feasible would it be to use this: https://llamahub.ai/l/tools-salesforce?

I don't think we should have a separate salesforce worker. An endpoint, /salesforce in the existing API should do the trick. Can you choose what format (i.e. JSON or CSV) that the data is exported in?

mmabrouk commented 10 months ago

Small note: I'd look into how airbyte solve this: https://github.com/airbytehq/airbyte/tree/f54bd550aae9b4bf19220b50af47da0adc3b4ff1/airbyte-integrations/connectors/source-salesforce

dgarnitz commented 10 months ago

We are planning to add an Airbyte connector, maybe we can access the salesforce data through that

david-vectorflow commented 10 months ago

@asadnhasan are you still planning on working on this?

asadnhasan commented 10 months ago

@david-vectorflow Yes, David I am still working on it.

dgarnitz commented 5 months ago

@asadnhasan hey do you still have an interest in building this out?

syedzaidi-kiwi commented 5 months ago

Yes absolutely, I would love to contribute. Will come with something in 2-3 days.