estuary / connectors

Connectors for capturing data from external data sources
Other
48 stars 15 forks source link

Request a connector to [capture from | materialize to] [your favorite system] #849

Open eduardoluismarin opened 1 year ago

eduardoluismarin commented 1 year ago

System Name

Google Drive

Type

Both

Details

I want to capture PDF, doc and txt files

dyaffe commented 1 year ago

We are planning on adding a Google Drive CSV capture very soon. Can you tell us more about your use case since this is the first time we've received a request for PDFs?

eduardoluismarin commented 1 year ago

Hello happy Hump Day to you...

Thanks for you email and your prompt response

I Let me clarify to you. I want to ingest pdfs files because I am creating a knowledge base and I am integrating your solution with chatgpt I now they are static documents but I am centralizing all knowledge base in your solution Regards

On Wed, Jul 26, 2023, 02:28 dyaffe @.***> wrote:

We are planning on adding a Google Drive CSV capture very soon. Can you tell us more about your use case since this is the first time we've received a request for PDFs?

  • How would you want us to ingest PDF documents and sync them to other systems?
  • What's the use case?

— Reply to this email directly, view it on GitHub https://github.com/estuary/connectors/issues/849#issuecomment-1651003639, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAZCV7YUPEFUIFVVKCFYY3DXSCTIDANCNFSM6AAAAAA2XRYVF4 . You are receiving this because you authored the thread.Message ID: @.***>

psFried commented 1 year ago

@eduardoluismarin Can you describe how you'd want the data from these to be structured?

For txt files, it seems pretty straight forward to have it produce something like {"content": "the full contents of the txt file..."}. But google docs and especially PDFs can contain very complex structures and content, and it's not necessarily clear how those ought to be represented. Do you have an example of what you might want in terms of the JSON representation, or even just how you'd expect a document to be represented when requesting embeddings from the openai api?