Following data in some repo/by some URI #93

Open KOLANICH opened 11 months ago

Brief overview

Imagine that some repos on GitHub, GitLab and other source code hostings contain some datasets for testing purposes, though those datasets can be useful for other purposes too. So tables in those repos are primary sources of truth.

A user may want a nice Web GUI and other extended capabilities for the data. He can upload them to your service.

But it wiuld require him to reupload the data to your service each tike it makes a commit. Despite that currently the service has no feature to do it at all, it is an additional burden. Anyway, storing data files in git repos is quite convenient.

Additional details

So the proposal is following.

introduce 2 additional kinds of a "table", git and URL.
a git is a direct reference to a git repo.
Though all the paths should be enabled manually.
HTTPS scheme is a must.
git partial clone protocol (already suported by GitHub for quite some time) can be used to fetch single files from repos
an URL is a direct reference to an URL to fetch data using any lib dealing with HTTP
in both cases a repo hook can be used to trigger fetches
if there is no repo hook, the servuce can itself check the table for updates using the appropriate means
the difference of git and url is that the service can use git protocol to fetch some info about commit history, like authorship information, activity statistics, schema changes and previous versions

This is a good idea. There is some groundwork towards this - the "userdata" storage is pluggable (but so far - only postgres). I initially thought I might provide some other, columnar, option (eg clickhouse) but probably HTTP is a better first step.

There is some work to do the caching/updating/etc. It's obviously not great to do that as part of a request-response cycle, perhaps a background worker would be needed.

Github will serve files over HTTP. So HTTP first?

If fetching is triggered by a hook, any way is almost equally easy to implement. If fetching is triggered by csvbase itself on timer, then one has to check the remote party for the changes in file content, and it is a can of worms, remote http server can be implemented to behave any way it wants (there are several headers for that, https://pypi.org/project/requests-cache/ may be useful). So IMHO the fast solution is either to restrict urls to a finite set of well-known code hostings having certain known behaviour, or to implement git, which blobs are content-addressable, first.

Work on this is coming together.

As it stands: you can create read-only tables that are based on github repos: https://csvbase.com/calpaterson/top-pypi-packages-30-days. Write support is planned (author-email: you user's email address). Other hosts than github are also planned and the mechanism is likely fairly portable. It uses "blobless" cloning. I couldn't see a way to clone single files.

Git-backed tables are updated periodically - every 30 minutes. Reducing this time via using github's webhooks is planned. I probably won't add support for other host's webhook mechanisms.

S3, HTTP(S) and SQL are also planned.

This is a complicated nest of features but it's at the top of the to-do list now and support for this will grow rapidly.

calpaterson / csvbase

Following data in some repo/by some URI #93

Brief overview

Additional details