Closed danfowler closed 7 years ago
I'd argue that Data Packages are a really great way to publish clean versions of datasets. In some ways, "packaging" a dataset in a Data Package implies cleaning it to the point that it can be well described. Given that we already have a clean version of this data to publish, I will move to drop the non-cleaned REFIT dataset.
What do you think @jobarratt @cblop ?
@danfowler can this now be closed?
Describing the clean data is fine for the pilot.
However, we've discussed universities sharing cleaned vs raw datasets when visiting Loughborough and Strathclyde recently. The consensus seems to be that sharing raw data (or both raw and cleaned) is best, as each university would clean their data in different ways, and the only person you can rely on to clean it the way you want is yourself. Therefore, the raw data should always be made available in a research data repository.
Thanks @cblop for the insight. I agree that, for the pilot, we will provide the metadata for the already-cleaned datasets and minimally messy datasests. Perhaps in the ultimate write-up, I can outline a path (that may or many not exist yet) from a more-then-minimally messy dataset to a curated Data Package using our tooling.
We have two versions of the the "REFIT: Electrical Load Measurements" dataset, one clean and one raw. My feeling is that our Frictionless Data specifications and tooling are most useful for describing clean, well structured data in a standard way, so I would opt to only model the refit-cleaned dataset and drop the raw version.