frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 107 forks source link

Excel resource with multiple worksheets #956

Closed mjarrett closed 5 days ago

mjarrett commented 6 days ago

I'm interested in how people are applying the spec to the case where an Excel workbook (.xlsx) has tabular data in multiple worksheets.

I see this issue (https://github.com/frictionlessdata/datapackage/issues/793) which suggests specifying the sheet name in the dialect attribute. The example provided makes sense but only if there is a single worksheet in the file. I've been considering a case like this:

{
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "mydata",
      "path": "mydata.xlsx",
      "dialect": {
        "sheet": "sheet1"
      },
   {
      "name": "mydata",
      "path": "mydata.xlsx",
      "dialect": {
        "sheet": "sheet2"
      }
    }
  ]
}

But in the above case there are two resources with the same name and path. I'm not even sure the worksheets should be considered separate resources, as my understanding is that in general a resource corresponds to a single file. I suppose this is not limited to Excel, as there are other file types where multiple tables can be saved in a single file.

I'm very interested if/how others have dealt with this situation. Examples would be very appreciated!

khusmann commented 5 days ago

as my understanding is that in general a resource corresponds to a single file

I would say instead in general a resource corresponds to a single table.

When a format stores a single table per file, then yes, there's a 1-1 relationship between resources and files. But in the case of spreadsheets (and databases), where multiple tables are stored in a single file, you'll end up having multiple resources (one for each table) defined using that same file. So the example definition you shared looks good to me!

One thing to be aware of -- version 2 of the table dialect standard uses sheetName instead of sheet. I believe at some point this will be implemented in frictionless-py, but I'm not sure what the timeline is/will be.

khusmann commented 5 days ago

Oh oops, I didn't notice the resources in your example also have the same name! You'll want to make those unique within the data package. A naming scheme something like this would do, for example:

{
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "mydata-sheet1",
      "path": "mydata.xlsx",
      "dialect": {
        "sheet": "sheet1"
      },
   {
      "name": "mydata-sheet2",
      "path": "mydata.xlsx",
      "dialect": {
        "sheet": "sheet2"
      }
    }
  ]
}
mjarrett commented 5 days ago

Thank @khusmann thinking of a resource as a table instead of a file makes perfect sense, I'll go with something like your example.