Should we model the refit database if we already have refit-cleaned?

frictionlessdata / pilot-dm4t

Pilot project with DM4T

http://www.cs.bath.ac.uk/dm4t/index.shtml

1 stars 1 forks source link

Should we model the refit database if we already have refit-cleaned? #12

Closed danfowler closed 7 years ago

danfowler commented 7 years ago

We have two versions of the the "REFIT: Electrical Load Measurements" dataset, one clean and one raw. My feeling is that our Frictionless Data specifications and tooling are most useful for describing clean, well structured data in a standard way, so I would opt to only model the refit-cleaned dataset and drop the raw version.

danfowler commented 7 years ago

I'd argue that Data Packages are a really great way to publish clean versions of datasets. In some ways, "packaging" a dataset in a Data Package implies cleaning it to the point that it can be well described. Given that we already have a clean version of this data to publish, I will move to drop the non-cleaned REFIT dataset.

What do you think @jobarratt @cblop ?

pwalsh commented 7 years ago

@danfowler can this now be closed?

cblop commented 7 years ago

Describing the clean data is fine for the pilot.

However, we've discussed universities sharing cleaned vs raw datasets when visiting Loughborough and Strathclyde recently. The consensus seems to be that sharing raw data (or both raw and cleaned) is best, as each university would clean their data in different ways, and the only person you can rely on to clean it the way you want is yourself. Therefore, the raw data should always be made available in a research data repository.

danfowler commented 7 years ago

Thanks @cblop for the insight. I agree that, for the pilot, we will provide the metadata for the already-cleaned datasets and minimally messy datasests. Perhaps in the ultimate write-up, I can outline a path (that may or many not exist yet) from a more-then-minimally messy dataset to a curated Data Package using our tooling.