ScotGovAnalysis / opendatascot

An R package to pull data from statistics.gov.scot into R
https://scotgovanalysis.github.io/opendatascot/
MIT License
47 stars 6 forks source link

Replace readr with vroom #100

Open GordonBryden opened 3 years ago

GordonBryden commented 3 years ago

Allegedly significantly fast. I'll test it out, and replace it 1 for 1 if succesful.

Robsteranium commented 3 years ago

I would guess that the bottleneck would be getting the data over the wire rather than parsing it per se but I suppose this could have a more significant impact for larger datasets.

It looks like the ods_get_csv function already has a fallback for when readr isn't installed so you might just add vroom as a progressive enhancement rather than replacing readr 1-for-1.

GordonBryden commented 2 years ago

I think that's a reasonable. I've been thinking about how to reduce the number of packages required, and I would like to try and move readr and vroom to optionals, but at present the blockage is that readr is vital for easy handling of the data given to it from the api, while read.csv cases error with some of the use cases. Some further investigation of the issue is required.

Robsteranium commented 2 years ago

Having fewer requirements is a laudable aim but it's increasingly feeling like the tidyverse is R's standard library! Personally, I wouldn't worry about having readr in Imports.

To do the type coercion correctly you really need the data types for each column. You can get this from e.g. sparql results if you request application/sparql-results+json. I had expected to need this when writing the linked-data-frames package but found it easier just to let readr do it! I expect I'll need to revisit this whenever I end up needing to parse dates or durations.

If you want to stick with the csv download then this would really benefit from a csvw annotation to provide the types (e.g. parsing with csvwr). We're keen to add csvw-downloads to PMD at some point but in the mean time it might actually be possible to prepare a generic template to generate annotations for statistics.gov.scot cubes since their shape follows a pattern. This wouldn't catch cases (that readr will handle) where people had put e.g. numbers in code-labels with the intention that they be interpreted numerically.

Can you remember which use cases error with read.csv? These could potentially be bugs in the download pipeline!

Just returning to the original motivation - faster loading - I wonder whether there's another way... vroom achieves the speed-up with lazy-loading which clearly has a benefit when you can avoid reading in data that you will later filter out. Using vroom would still require that you download the whole dataset so that the file could be memory-mapped. Something like dbplyr for sparql would instead let you only download the slice of data you needed. But this requires users to adopt the dplyr API which really puts us into tidyverse!