Open jlehtoma opened 7 years ago
If only the location has changed then this may not be such a big deal. Otherwise it might be.
The URLs listed in https://github.com/avoindata/mml/blob/master/rscripts/Kapsi/kapsi2rdata.R are still kosher, so fortunately there is no need for bigger update. It would be good to have all data in dirs per year, now there seems to some redundancy in the repo, e.g. years 2012 and 2016 separately and then e.g. Yleiskartta-1000
which is also found in 2012
.
I see, e.g. Yleiskartta-1000
is the current version and anything in 2012
is in archive?
Yes. the 2012 folder was added after someone requested the old versions after I had removed them. But this was years ago. I doubt that anyone really needs the 2012 folder any more, it could be removed for clarity as well. I am not aware of other apps than ropengov/louhos pkgs that use this data resource so I think even file/folder structure can be improved/changed if necessary.
Is the consensus now that the data will be fetched directly from github, or shall we find another host or even release a separate data package?
I would keep the old versions of there's space. In fact, it's a shame we haven't actively collected these. AFAIK, no instance keeps a (public) record of changing datasets, which may actually be very useful in studies.
Right. We can try collect these from now on. Not sure how often the data is updated. Collecting annually might be sufficient.
You may already know my take on the hosting issue 😉 I don't know about the consensus. I don't think a separate data package is really necessary, although hosting one using drat
would simplify certain things.
Data package would potentially reduce network traffic and speed up execution in some cases but not sure how essential this would be. Github certainly fine with as long as proved otherwise.
Since the data doesn't update so often (max annually I guess), a data package would simplify things at least because:
However, there would be a bit of conceptual shift for the packages using mml
should it transform from a data store to a data package. Packages depending on it (such as gisfin
) would be less "API packages" and more like conventional packages (this is not an issue as such). We would also be packaging somebody else's data. Don't know, need to think about it more.
The MML license should perfectly allow data packaging as far as I see. Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service. To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.
The MML license should perfectly allow data packaging as far as I see.
Yes, it does. Ideally the data provider deals with the packaging the data, but in this case we could be just as good (or better!).
Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service.
Yep, but the data is still loaded on-need basis. It's worth noting that Kapsi is not a MML service either, which makes this even less API-like.
To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.
I agree that downloading the data every time is not practical. However, as a user I would like the package to do as little filtering as possible (i.e. subsetting data). Value-adding pre-processing (fixing strings, setting types etc) is great, as long as it's clear what was done. In this sense a data package might be a very good solution as it enables good documentation, versioning and provenance (i.e. distributing the code). Currently, it's a bit unclear where the data is coming from (mml
repo, the original references are well handled) and what has been done it.
If we package the data using drat
, it will still be hosted on Github.
+1 for data package, in other words.
Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.
Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.
As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.
Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.
Obviously having the package in CRAN wouldn't hurt as so long as 1) the size-limit (~5 MB) is not an issue, and 2) one is willing to get an angry email from BDR :smile: .
Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.
Might be a bit overkill, yes.
As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.
Well, if we switch completely to sf
objects (which I think we should), then they are in essence data.frame
s. If Feather files compress well, this might be a good option. AFAIK, Feather is mostly meant for interoperability between e.g. R and Python and not for long term storage, so there's that.
R data packages hosted in CRAN can exceed the typical 5MB size limit if we can motivate the need for BDR. Perhaps good to start with Github+drat and move to CRAN later if it seems useful.
Not sure if RData or Rds are any better for long-term storage than feather. Except for the fact that Feather is still under development and may hence be less stable. Ok, perhaps Rds files would be the best here now (saveRDS / readRDS)
+1 to everything.
Things have changed at Kapsi and this repo should be updated accordingly.