frictionlessdata / datapackage-r

An R package for working with Data Package.
https://frictionlessdata.github.io/datapackage-r/
Other
43 stars 7 forks source link

How to create a data package from data.frame with known meta information #13

Closed HeidiSeibold closed 5 years ago

HeidiSeibold commented 6 years ago

I have the following data set:

library("OpenML")
omldat <- getOMLDataSet(data.id = 40505)
str(omldat$data)
#> 'data.frame':    86 obs. of  10 variables:
#>  $ counts     : num  0 0 0 0 0 2 0 0 0 0 ...
#>  $ age        : num  120 120 120 120 120 200 200 200 120 200 ...
#>  $ coverstorey: num  80 70 90 90 90 80 70 75 90 55 ...
#>  $ coverregen : num  60 90 70 20 20 80 90 60 20 100 ...
#>  $ meanregen  : num  7 3 6 7 4 5 5 5 4 9 ...
#>  $ coniferous : num  20 25 40 5 1 0 0 0 0 1 ...
#>  $ deadtree   : num  0 1 0 0 0 1 0 0 0 0 ...
#>  $ cbpiles    : num  4 2 7 11 11 3 1 2 9 2 ...
#>  $ ivytree    : num  0 0 0 0 0 0 0 0 9 0 ...
#>  $ fdist      : num  100 10 110 10 110 30 40 60 20 60 ...

I have additional meta information:

str(omldat$desc)
#> List of 24
#>  $ id                      : int 40505
#>  $ name                    : chr "treepipit"
#>  $ version                 : chr "1"
#>  $ description             : chr "Data on the population density of tree pipits, Anthus trivialis, in Franconian oak forests including variables "| __truncated__
#>  $ format                  : chr "ARFF"
#>  $ creator                 : chr ""
#>  $ contributor             : chr NA
#>  $ collection.date         : chr "March to June 2002"
#>  $ upload.date             : POSIXct[1:1], format: "2016-09-07"
#>  $ language                : chr "English"
#>  $ licence                 : chr "GPL-2"
#>  $ url                     : chr "https://www.openml.org/data/v1/download/4552968/treepipit.arff"
#>  $ default.target.attribute: chr "counts"
#>  $ row.id.attribute        : chr NA
#>  $ ignore.attribute        : chr NA
#>  $ version.label           : chr NA
#>  $ citation                : chr "Müller, J. and Hothorn, T. (2004). Maximally selected two-sample statistics as a new tool for the identificatio"| __truncated__
#>  $ visibility              : chr "public"
#>  $ original.data.url       : chr NA
#>  $ paper.url               : chr NA
#>  $ update.comment          : chr NA
#>  $ md5.checksum            : chr "4645a65d33c5c3a65121eb2afe5e8866"
#>  $ status                  : chr "active"
#>  $ tags                    : chr NA
#>  - attr(*, "class")= chr "OMLDataSetDescription"

The meta information includes things that can also be represented in data packages, such as description and licence as well as something like citation, which I assume should be the sources field in the data package.

I have been trying for a little while now to create a Package object from this with all the meta data. Using the vignette I was not able to do this. Can you help me out there? I would very much appreciate this :cake:

Background info: we would like to allow exporting OpenML datasets to data packages (https://github.com/openml/OpenML/issues/482).

kleanthisk10 commented 6 years ago

The error you get is this: "Error: No method asJSON S3 class: OMLDataSetDescription" ?

HeidiSeibold commented 6 years ago

I don't get an error, I just don't know where to start.

I got this far

library("datapackage.r")
dataPackage <- Package.load(descriptor = list(profile = "tabular-data-package",
                                              title = omldat$desc$name,
                                              name = omldat$desc$name))

Now where do I enter the data.frame, how do I add the info on the variable types (or does it do that automatically for data.frames?), ...

HeidiSeibold commented 6 years ago

I can't use the basePath argument since I have the data set already loaded as a data.frame. All your examples show how to work with csv-files. None show how to work with data already loaded into R.

kleanthisk10 commented 6 years ago

Inputs in datapackage should be lists or json. One option is to make a list of both resources and to convert them to json.

resources=append(omldat$desc,list(data = omldat$data))
descriptor = list(resources=list(resources))
library(datapackage.r)
descriptor = helpers.from.list.to.json(descriptor)

Then you could instantiate your Package class:

datapackage = Package.load(descriptor)

and for example you can retrieve back your data as list:

datapackage$resources[[1]]$read()

or as data frame:

jsonlite::fromJSON(helpers.from.list.to.json(datapackage$resources[[1]]$read()))