dolthub / dolthub-issues

Issues for dolthub.com
https://dolthub.com
4 stars 1 forks source link

Provide commit version information in eTag for downloads #518

Closed noamross closed 6 months ago

noamross commented 9 months ago

Is your feature request related to a problem? Please describe. Our build system, like many others, makes use of the common HTTP response header eTag to check if data is updated, and HEAD requests to check whether to download data. DoltHub download responses (e.g., https://www.dolthub.com/csv/ecohealthalliance/wahisdb/main/people_role_relation) don't have this or other cache information in the header.

Describe the solution you'd like It would be excellent if CSV/ZIP/Excel and other direct downloads from repositories have HTTP response headers with version information in them. It makes natural sense for the eTag value to be the commit hash of what is pulled. (While a table might be the same between commits, I'm unsure if a table-level hash is a concept in Dolt like an object/blob hash is in git). This is most important for links that reference the HEAD or a branch of a database, but it seems it can apply to everything.

It also may be a good idea, for links that directly reference a commit hash (e.g. https://www.dolthub.com/csv/ecohealthalliance/wahisdb/usnlgpjqtcgl1fpd0g13mls07p155hga), to set Cache-control headers like immutable.

Describe alternatives you've considered In our own work we have workarounds like making an API request to get the commit hash of the HEAD or a branch and trigger a download on update.

Another possibility would be for links to HEAD or a branch or tag to act as redirects to the fixed commit-based URL, but an eTag seems simpler and a standard that applies across sites.

noamross commented 9 months ago

I note raw GitHub downloads have etags (e.g., https://raw.githubusercontent.com/codemeta/codemeta/66284de845a413414be98e63d8eeee3569619de2/codemeta.json has etag: W/"77fe8b497c0a90667e5cf550884dc23a1bef2be9b9ae68d62442dad0cf6099a0"), though they don't appear to be the commit or object blob but something else.

The Last-Modified response header would also be useful and could be derived from Dolt data. It's as least as commonly used to check and is often a fallback for etags.

liuliu-dev commented 6 months ago

Hi @noamross , I added the etag for downloads. for links include a commit hash, the etag is an encoded hash of the commit. for links include a branch name, the etag is an encoded hash of the head commit of the branch. in both cases, Cache-control is set to be immutable.

noamross commented 6 months ago

This is great, thank @liuliu-dev!