airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.2k stars 4.14k forks source link

[source-file] Multiple sheets in XLSX #47445

Open ex0ns opened 2 weeks ago

ex0ns commented 2 weeks ago

Connector Name

source-file

Connector Version

0.5.13

What step the error happened?

None

Relevant information

I was trying to load an Excel (XLSX) file containing multiple sheets and I noticed that in the output all my headers were actually mixed up and no information about the sheet themselves were kept.

I was expecting an outcome similar to the one we can have when loading data from a Google Sheet, where it would create a source and within this source we would have table (i.e streams) for each of the sheet of the document.

This seems related to this part of the code: https://github.com/airbytehq/airbyte/blob/b1b2f9c744408665d29f115826eab8d36e3b503e/airbyte-integrations/connectors/source-file/source_file/client.py#L507-L528

Is there a reason it was done that way ? Would it be possible to keep information about each of the existing sheet of the document ? I don't have any experience with Airbyte source code so I wanted to make sure I was looking at the right place, and maybe get a few pointers on where to start in order to contribute and maybe improve the Excel reader, but I first wanted to understand why it was done this way in the first place.

Thanks !

Relevant log output

No response

Contribute

marcosmarxm commented 3 days ago

@ex0ns let me know if you need any assistance doing the contribution.

ex0ns commented 3 days ago

I did not start looking at this yet, I wanted to understand if there were technical challenges and/or why it was made that way, there is even a not in the connector:

Image