calpaterson / csvbase

a simple website for sharing table data - with an API
https://csvbase.com
GNU Affero General Public License v3.0
373 stars 12 forks source link

Following data in some repo/by some URI #93

Open KOLANICH opened 11 months ago

KOLANICH commented 11 months ago

Brief overview

Imagine that some repos on GitHub, GitLab and other source code hostings contain some datasets for testing purposes, though those datasets can be useful for other purposes too. So tables in those repos are primary sources of truth.

A user may want a nice Web GUI and other extended capabilities for the data. He can upload them to your service.

But it wiuld require him to reupload the data to your service each tike it makes a commit. Despite that currently the service has no feature to do it at all, it is an additional burden. Anyway, storing data files in git repos is quite convenient.

Additional details

So the proposal is following.

calpaterson commented 11 months ago

This is a good idea. There is some groundwork towards this - the "userdata" storage is pluggable (but so far - only postgres). I initially thought I might provide some other, columnar, option (eg clickhouse) but probably HTTP is a better first step.

There is some work to do the caching/updating/etc. It's obviously not great to do that as part of a request-response cycle, perhaps a background worker would be needed.

Github will serve files over HTTP. So HTTP first?

KOLANICH commented 11 months ago

If fetching is triggered by a hook, any way is almost equally easy to implement. If fetching is triggered by csvbase itself on timer, then one has to check the remote party for the changes in file content, and it is a can of worms, remote http server can be implemented to behave any way it wants (there are several headers for that, https://pypi.org/project/requests-cache/ may be useful). So IMHO the fast solution is either to restrict urls to a finite set of well-known code hostings having certain known behaviour, or to implement git, which blobs are content-addressable, first.

calpaterson commented 4 months ago

Work on this is coming together.

As it stands: you can create read-only tables that are based on github repos: https://csvbase.com/calpaterson/top-pypi-packages-30-days. Write support is planned (author-email: you user's email address). Other hosts than github are also planned and the mechanism is likely fairly portable. It uses "blobless" cloning. I couldn't see a way to clone single files.

Git-backed tables are updated periodically - every 30 minutes. Reducing this time via using github's webhooks is planned. I probably won't add support for other host's webhook mechanisms.

S3, HTTP(S) and SQL are also planned.

This is a complicated nest of features but it's at the top of the to-do list now and support for this will grow rapidly.