frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 107 forks source link

Promote "Compression of resources" recipes to the Data Resource spec #901

Open roll opened 2 months ago

roll commented 2 months ago

Overview

There is quite a simple recipe - https://datapackage.org/recipes/compression-of-resources/ - adding a new resource.compression property to the Data Resource spec. It's supported by frictionless-py.

Note

It might make sense to consider frictionless-py's resource.innerPath as well for providing a path inside an archive.

peterdesmet commented 2 months ago

frictionless-r ignores the resource.compression property. It does support reading compressed files based on the extension found in path: https://docs.ropensci.org/frictionless/reference/read_resource.html#file-compression

So I'm neutral regarding promoting this to the specs: frictionless-r will likely continue to ignore it.

peterdesmet commented 2 months ago

Regarding resource.innerPath, I'd rather not see an additional property for files, since we already have to deal with data and path. I would express this in path, as follows:

"path": "path/to/my/archive.zip/data.csv"
khusmann commented 2 months ago

frictionless-r ignores the resource.compression property. So I'm neutral regarding promoting this to the specs

Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?

Regarding resource.innerPath, I'd rather not see an additional property for files, since we already have to deal with data and path. I would express this in path, as follows:

I think this works for local paths, but not as well for remote ones because it's harder to detect where the zip file is (archive.zip may not be an actual zip file, but just part of the url, and it's harder to check compared to local)

That said, I'm not keen on innerPath either. Regarding compression I would expect two main scenarios / use cases:

1) Individual resources are compressed (as described in the pattern). This is useful for remote data packages being hosted remotely -- you only need to download the data you need (and it is transferred compressed). This doesn't need innerPath because the compressed files do not contain multiple files.

2) The entire data package (including datapackage.json) is compressed. This is useful for an archival blob of the entire package that can be distributed as a single unit, without dependencies. (Similar to an opendocument spreadsheet with multiple sheets). This doesn't need innerPath either, because everything is already inside the zip file so paths are already internal to the zip.

innerPath is only applicable when a data package is a) referencing multiple resources in a single zip and b) the datapackage.json isn't included in that level of compression. I don't think this is something we want to support / encourage -- Or maybe I'm missing a benefit / use case?

peterdesmet commented 2 months ago

I don't think this is something we want to support / encourage.

I agree.

roll commented 2 months ago

Agreed -- inferring compression from the path seems just fine. Do we get some other value from having a "compression" property that I'm missing?

I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well :smiley: So in my opinion it is just a question of increasing interoperability documentation quality. I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it. On the other hand as there is already resource.format, resource.compression feels like the same kind of indicator.

That said, I'm not keen on innerPath either. Regarding compression I would expect two main scenarios / use cases:

So regarding inner path I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive similarly how excel sheets mapped onto resources with Table Dialect

khusmann commented 2 months ago

I will play a devil's advocate role here, but you know, inferring a Table Schema usually works just fine as well 😃

Haha, to play counter devil's advocate: It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz), but there's not a similar standard for field names to include frictionless field type information (column1.integer, column2.boolean, column3.number).

To me, the question is, do we want to allow compressed paths without extensions, or compressed paths with extensions that don't match the compression type? Otherwise, resource.compression is redundant and will be largely ignored by implementations (as frictionless-r does right now).

I think it's only applicable if a data publisher has to use some artifact i.e. ZIP file that they cannot control so they map resources from this archive

If they don't control the ZIP, then there's all kinds of malformed scenarios we can imagine... The question is where we draw the line.

For example, the ZIP could have other nested ZIPs in it, which in turn hold the table data... That would require nested innerPath properties, which I don't think we should support either.

similarly how excel sheets mapped onto resources with Table Dialect

Selecting an excel sheet in a workbook or a table in an SQLite db is a lot more well-defined, I think, because unlike a generic multi-file archive they have a lot of guarantees / constraints (e.g. they only hold a specific kind of table data and cannot be nested)

Side note -- Does the sheetName property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)

khusmann commented 2 months ago

I think, currently, the spec doesn't mention compression at all so the behavior does look just undefined generally speaking. I think we at least need to clarify it.

I agree on this though! I'd suggest something like 1) compression type MAY be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST only contain only one file.

roll commented 2 months ago

It's standard for file names to include compression type in their extension (file1.csv.zip, file2.csv.gz)

I think it's more like a convention rather than a standard

Side note -- Does the sheetName property in Table Dialect also allow you to select particular tables in an SQL db? (are SQLite DBs considered as "spreadsheet" formats?)

No, we had table property in the draft for SQL but I removed it for now to wait for an actual user request

I agree on this though! I'd suggest something like 1) compression type MAY be specified via path extension (and here's a supported list of formats) and 2) when paths to archives are used, they MUST only contain only one file.

I think, currently, we don't define anything regarding the form of resource.path (regarding format or compression). We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required. Personally, I don't have preferences -- requiring one file per archive seems a reasonable approach. My main point here is as "an implementor" I need some clear definition like "if it is an archived file, resource.path MUST ends with .gz or .zip prefix indicating the compression algorithm" (and reading this sentence I still feel that a dedicated property might be kind better than parsing a path :smiley: )

fjuniorr commented 2 months ago

No, we had table property in the draft for SQL but I removed it for now to wait for an actual user request

@roll you mean something like this would not be supported? I do use this internally and crafted this gist for a user query in Frictionless Slack.

roll commented 2 months ago

@fjuniorr We can totally add dialect.table if there is a demand for it cc @dafeder

khusmann commented 2 months ago

I think it's more like a convention rather than a standard

I agree, I was being sloppy with my language there :)

To be clear, I'm neutral on the resource.compression property. My only slight preference here is that we choose to use the extension in the path, OR resource.compression, but not both... that way they are not in competition. But I defer to stronger opinions on this.

My main point here is as "an implementor" I need some clear definition We might consider adding compression information to https://datapackage.org/specifications/data-resource/#path-or-data-required.

Agreed! I think that's a perfect place to put this info.

I do think enforcing one file per archive is a good idea (no innerPath property). This way it's stays natural to specify multi-part resources with compression, and we don't need an exception for archives (e.g. "path" = ["file1.csv.gz", "file2.csv.gz"]). We can always relax the one file per archive rule and add an innerPath if there's demand for it later.

I also think it'd be nice to mention somewhere the practice of compressing entire data packages. (frictionless-py already supports this as well).

We can totally add dialect.table if there is a demand

I would also very much support this... I was surprised to see spreadsheet support & mention of sql databases, but no clear way to select an sql table :)