Open KOLANICH opened 11 months ago
This is a good idea. There is some groundwork towards this - the "userdata" storage is pluggable (but so far - only postgres). I initially thought I might provide some other, columnar, option (eg clickhouse) but probably HTTP is a better first step.
There is some work to do the caching/updating/etc. It's obviously not great to do that as part of a request-response cycle, perhaps a background worker would be needed.
Github will serve files over HTTP. So HTTP first?
If fetching is triggered by a hook, any way is almost equally easy to implement. If fetching is triggered by csvbase itself on timer, then one has to check the remote party for the changes in file content, and it is a can of worms, remote http server can be implemented to behave any way it wants (there are several headers for that, https://pypi.org/project/requests-cache/ may be useful). So IMHO the fast solution is either to restrict urls to a finite set of well-known code hostings having certain known behaviour, or to implement git
, which blobs are content-addressable, first.
Work on this is coming together.
As it stands: you can create read-only tables that are based on github repos: https://csvbase.com/calpaterson/top-pypi-packages-30-days. Write support is planned (author-email: you user's email address). Other hosts than github are also planned and the mechanism is likely fairly portable. It uses "blobless" cloning. I couldn't see a way to clone single files.
Git-backed tables are updated periodically - every 30 minutes. Reducing this time via using github's webhooks is planned. I probably won't add support for other host's webhook mechanisms.
S3, HTTP(S) and SQL are also planned.
This is a complicated nest of features but it's at the top of the to-do list now and support for this will grow rapidly.
Brief overview
Imagine that some repos on GitHub, GitLab and other source code hostings contain some datasets for testing purposes, though those datasets can be useful for other purposes too. So tables in those repos are primary sources of truth.
A user may want a nice Web GUI and other extended capabilities for the data. He can upload them to your service.
But it wiuld require him to reupload the data to your service each tike it makes a commit. Despite that currently the service has no feature to do it at all, it is an additional burden. Anyway, storing data files in git repos is quite convenient.
Additional details
So the proposal is following.
git
andURL
.git
is a direct reference to a git repo.URL
is a direct reference to an URL to fetch data using any lib dealing with HTTPgit
andurl
is that the service can usegit
protocol to fetch some info about commit history, like authorship information, activity statistics, schema changes and previous versions