cagov / data-infrastructure

CalData infrastructure
https://cagov.github.io/data-infrastructure
MIT License
7 stars 0 forks source link

Test various ways of loading excel sheets with Fivetran #149

Closed ian-r-rose closed 1 year ago

ian-r-rose commented 1 year ago

A number of our projects will likely involve loading Excel sheets of various types. Fivetran does support Excel, but it doesn't seem to support very complex excel sheets. Basically, the further it looks from a plain CSV, the less likely we will be to properly read it. If there are header rows, multiple sheets, formulas, or macros, we could run into trouble and have to fall back on custom scripts or manual processes.

I don't have the strongest understanding of Fivetran's excel limitations, and the docs leave something to be desired. I'd like to do a bit of testing to make sure that I understand which approaches work well, and which ones don't, including:

  1. Files stored in a Sharepoint folder
  2. Files stored in Google Drive
  3. Files with multiple sheets
  4. Files with header rows
  5. Files with a pivot table off to the side
  6. ...
ian-r-rose commented 1 year ago

A few notes on the above:

  1. I successfully loaded an Excel file from my personal DGS Sharepoint site. I found the ontology between OneDrive and Sharepoint to be extremely confusing, and bounced back and forth between those two versions of the connector, but Sharepoint is the one that worked for our particular tenant.
  2. I successfully loaded the same Excel file from a Google Drive folder, after having shared that folder with a Fivetran email address per the setup instructions.
  3. Fivetran automatically loads all sheets in an Excel file into tables of the form {file_name}_{sheet_name}. This does not appear to be configurable, it always happens. Worthwhile to give your sheets descriptive names, otherwise you get chaff like {my_file}_sheet_1. It's a bit annoying, because lots of sheets are probably not useful to load, and they crowd your destination schema.
  4. If there are any header rows (i.e., frontmatter that is not the CSV-ish header), those are treated as column names. This is almost certainly not what people want, and will result in a corrupted, incorrectly-read table. This is probably a blocker for reading many common Excel files!
  5. Pivot tables appear to be loaded correctly, though clearly the formulae powering them are not transferred over.
ian-r-rose commented 1 year ago

Some conclusions on this issue:

Closing this particular investigation as complete.