Update all scripts and data

jlehtoma commented 7 years ago

Things have changed at Kapsi and this repo should be updated accordingly.

antagomir commented 7 years ago

If only the location has changed then this may not be such a big deal. Otherwise it might be.

jlehtoma commented 7 years ago

The URLs listed in https://github.com/avoindata/mml/blob/master/rscripts/Kapsi/kapsi2rdata.R are still kosher, so fortunately there is no need for bigger update. It would be good to have all data in dirs per year, now there seems to some redundancy in the repo, e.g. years 2012 and 2016 separately and then e.g. Yleiskartta-1000 which is also found in 2012.

jlehtoma commented 7 years ago

I see, e.g. Yleiskartta-1000 is the current version and anything in 2012 is in archive?

antagomir commented 7 years ago

Yes. the 2012 folder was added after someone requested the old versions after I had removed them. But this was years ago. I doubt that anyone really needs the 2012 folder any more, it could be removed for clarity as well. I am not aware of other apps than ropengov/louhos pkgs that use this data resource so I think even file/folder structure can be improved/changed if necessary.

antagomir commented 7 years ago

Is the consensus now that the data will be fetched directly from github, or shall we find another host or even release a separate data package?

jlehtoma commented 7 years ago

I would keep the old versions of there's space. In fact, it's a shame we haven't actively collected these. AFAIK, no instance keeps a (public) record of changing datasets, which may actually be very useful in studies.

antagomir commented 7 years ago

Right. We can try collect these from now on. Not sure how often the data is updated. Collecting annually might be sufficient.

jlehtoma commented 7 years ago

You may already know my take on the hosting issue 😉 I don't know about the consensus. I don't think a separate data package is really necessary, although hosting one using drat would simplify certain things.

antagomir commented 7 years ago

Data package would potentially reduce network traffic and speed up execution in some cases but not sure how essential this would be. Github certainly fine with as long as proved otherwise.

jlehtoma commented 7 years ago

Since the data doesn't update so often (max annually I guess), a data package would simplify things at least because:

Less need for downloading/caching as user would just install the (data) package once until an update is issued.
Versioning and updating becomes easier.
Documenting the data becomes easier.

However, there would be a bit of conceptual shift for the packages using mml should it transform from a data store to a data package. Packages depending on it (such as gisfin) would be less "API packages" and more like conventional packages (this is not an issue as such). We would also be packaging somebody else's data. Don't know, need to think about it more.

antagomir commented 7 years ago

The MML license should perfectly allow data packaging as far as I see. Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service. To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.

jlehtoma commented 7 years ago

The MML license should perfectly allow data packaging as far as I see.

Yes, it does. Ideally the data provider deals with the packaging the data, but in this case we could be just as good (or better!).

Even the current solution is not really an "API package" since it is based on our pre-processed and independently distributed RData files rather than the MML service.

Yep, but the data is still loaded on-need basis. It's worth noting that Kapsi is not a MML service either, which makes this even less API-like.

To achieve conceptual clarity, the package functions should download the data straight from Kapsi/MML and perform preprocessing on the fly. But that is not practical. If we rely on our own preprocessed data anyway then I am not sure if it makes a big difference whether the data is hosted in Github or in a data package. Any pragmatic solution is fine.

I agree that downloading the data every time is not practical. However, as a user I would like the package to do as little filtering as possible (i.e. subsetting data). Value-adding pre-processing (fixing strings, setting types etc) is great, as long as it's clear what was done. In this sense a data package might be a very good solution as it enables good documentation, versioning and provenance (i.e. distributing the code). Currently, it's a bit unclear where the data is coming from (mml repo, the original references are well handled) and what has been done it.

jlehtoma commented 7 years ago

If we package the data using drat, it will still be hosted on Github.

jlehtoma commented 7 years ago

+1 for data package, in other words.

antagomir commented 7 years ago

Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.

Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.

As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.

jlehtoma commented 7 years ago

Yes an R data package starts to seem a good solution. I do not know whether CRAN data package or Github drat package is better. If there is no added value from hosting at CRAN then drat might be optimal as updates will be easier.

Obviously having the package in CRAN wouldn't hurt as so long as 1) the size-limit (~5 MB) is not an issue, and 2) one is willing to get an angry email from BDR :smile: .

Or we could ask Kapsi to add our scripts in their pipeline and host the data files/packages (for MML it is not realistic I think). But that would add some overhead and any changes/updates would be heavier to make.

Might be a bit overkill, yes.

As a side note, Feather would work great for sharing data frames but not in this case (shapefiles) I guess.

Well, if we switch completely to sf objects (which I think we should), then they are in essence data.frames. If Feather files compress well, this might be a good option. AFAIK, Feather is mostly meant for interoperability between e.g. R and Python and not for long term storage, so there's that.

antagomir commented 7 years ago

R data packages hosted in CRAN can exceed the typical 5MB size limit if we can motivate the need for BDR. Perhaps good to start with Github+drat and move to CRAN later if it seems useful.

Not sure if RData or Rds are any better for long-term storage than feather. Except for the fact that Feather is still under development and may hence be less stable. Ok, perhaps Rds files would be the best here now (saveRDS / readRDS)

jlehtoma commented 7 years ago

+1 to everything.

avoindata / mml

Update all scripts and data #1