enable importing of CSV data files

aryn-ai / sycamore

🍁 Sycamore is an LLM-powered search and analytics platform for unstructured data.

https://sycamore.readthedocs.io

Apache License 2.0

364 stars 43 forks source link

enable importing of CSV data files #201

Open visakha opened 10 months ago

visakha commented 10 months ago

Ingesting structured data is also a big requirement. companies have good understanding of their structured data already, they want the data to work for them and Aryn could be a channel to enable that.

solution PDF Ingestion should be like a plugin - I have not read the code yet, but if it is, then the same plugin architecture could be adopted for other file formats, in this case csv. if not then, I will be more than happy to work under guidance to implement it.

Alternatives No alternatives

Additional context Say that a client has customer data in a CRM system that is RDBMS backed, now we want to put some conversational intelligence into that space, how do we that. We would first export tables into CSV files 1:1 and then use Sycamore to ingest it and build relationships between the CSV files to Accelerate known knowledge.

bsowell commented 10 months ago

Hi @visakha. Thanks for the feedback! We definitely agree that structured data is super important in this space and we welcome suggestions on the best way to incorporate it.

We do have a JSON reader (https://github.com/aryn-ai/sycamore/blob/33e3245cc05594ef808969015afa732cc3a8813c/sycamore/scans/file_scan.py#L161) that might give some flavor for how this could work. I could see a CSV reader working similarly -- you specify which field to use as the "main content" (text_representation in Sycamore terms), and then read the rest as properties. Does This seem like it would work for your use case?

bsowell commented 10 months ago

Since you mentioned that that data originates in an RDBMS, another question is whether it would be useful to have connectors directly to the database rather than doing an intermediate CSV export.

alexaryn commented 9 months ago

If we end up doing CSV, we should add TSV at the same time. It's trivial and a lot easier to work with.

visakha commented 9 months ago

The reason I say CSV/ TSV (vs direct DB Conn) is the clean boundary. The integration concerns will have a clear Starting point

On Mon, Jan 8, 2024 at 3:44 PM Alex Meyer @.***> wrote:

If we end up doing CSV, we should add TSV at the same time. It's trivial and a lot easier to work with.

— Reply to this email directly, view it on GitHub https://github.com/aryn-ai/sycamore/issues/201#issuecomment-1881869461, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACATVRALOLBZ4FZDD3GNXRLYNRSDPAVCNFSM6AAAAABBJEU5N2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRHA3DSNBWGE . You are receiving this because you were mentioned.Message ID: @.***>