CJWorkbench / cjworkbench

The data journalism platform with built in training
http://workbenchdata.com
Other
303 stars 45 forks source link

Support for Frictionless Data specs #203

Open loleg opened 4 years ago

loleg commented 4 years ago

Reproducibility and traceability is clearly a prerogative of this project. Exporting data is one thing, but how about exporting metadata and workflows? I'm not sure if this is the right place to post module ideas, but here goes.

Has anyone looked into integrating Frictionless Data support, e.g. as an import format, to export column definitions (as Table Schema), generating a complete Data Package using datapackage-py, integrating with Goodtables for validation, or even making the JSON feed compatible with dataflows?

It looks like it would be straightforward to start by developing a Python module for data import, but I can't tell if it would be possible to export in that manner as well.

loleg commented 4 years ago

Side note: I kept hearing rave reviews from journalists and tried out your OpenRefine-like tool today. The interface is slick and the project thoughtfully designed, open source, actively maintained. Ticks a lot of boxes. Kudos!

adamhooper commented 4 years ago

We do plan to build a proper workflow for data export and API publishing. Thank you for bringing this Frictionless Data spec to our attention.

I'll close this issue once we support exporting metadata.

Also, thank you for the kind words!

loleg commented 3 years ago

The Frictionless Data project has been striding forward over the past year, with a new web site and refreshed Python framework. Any news on the topic of data export? If you are still interested in adding Data Package export and ideally Table Schema support, I would be happy to pitch in some way.

adamhooper commented 3 years ago

We're definitely still interested in Frictionless -- that's why this issue is open :).

For exporting, Frictionless looks ideal. We have an underlying problem with our APIs that we need to solve first, though. That's #219 and the solution will be a great new feature that exports to different URLs in different formats. Once we have that new API feature, we'll seriously consider giving it Frictionless metadata.

For importing, we'd love to add a Frictionless importer. Are you interested in trying one out? Workbench is still restricted to a single table at a time; so I envision the module being a form with these two fields:

datapackage URL: [ https://datahub.io/core/co2-ppm/datapackage.json ]
table: [ co2-annmean-mlo ]

[Update]

... for now, users would need to type the table name by hand. Later, Workbench will add the "multi-table" concept: we need it for Excel files and other clients, anyway.

Breaking down the operations:

"Fetch": download the entire data package (the JSON files, I suppose, since CSV null are tricky), and store in a tarfile/zipfile.

"Render": seek to the file in question and parse it using cjwparse. cjwparse doesn't have many options; but it doesn't destroy data, so some post-processing could be done to convert strings to "date", "timestamp", etc.

Modules to steal from:

(Why not Pandas? Two reasons: first, the Pandas API shifts but our plugin framework doesn't have a Pandas-versioning feature, so we're stuck with v0.25; second, Workbench doesn't support all Pandas' types anyway, and in our experience it's easier to start from a Workbench-valid Arrow table and refine it than to start with a Workbench-invalid Pandas table and then try to figure out all the ways in which it's invalid.)

If you're interested -- this may be a multi-day experiment -- let me know and we can revamp the docs as you go. They're a bit crufty, and I'd love to give them some love.

adamhooper commented 3 years ago

Hi @loleg,

I'm investigating export formats today, and I have a question.

Our idea is: the user selects which tabs belong in a "published" dataset, and then Workbench will bundle them up and present a suite of URLs for downloading pieces or the whole thing. We'll want both CSV (simple but lossy/tricky) and JSON (inefficient but supports nulls). (Someday, perhaps the Frictionless community will rally around Parquet, which is efficient and supports nulls....)

When I look at https://datahub.io/core/covid-19 I see there's one zipfile containing all files as both CSV and JSON. Is this the standard in the Frictionless community? Or would users prefer to choose from two zipfiles: one for CSV, one for JSON?