datalad / datalad-registry

MIT License
0 stars 2 forks source link

Sanitize unconventional URLs #147

Open candleindark opened 1 year ago

candleindark commented 1 year ago

As mention by @yarikoptic in https://github.com/datalad/datalad-registry/issues/125#issuecomment-1491109779, we may encounter URLs such as datalad-annex::file://{export}?type=directory&directory={{path}}&encryption=none&dladotgit=uncompressed and libarchive://deeply/nested/path::ftp:///archive.7z. Provide a solution to sanitize them.

yarikoptic commented 1 year ago

well, it wouldn't be needed AFAIK if we just use uuid or md5 of the url , and that is where we seems to be going to

candleindark commented 1 year ago

@yarikoptic Frankly, I have never seen such URLs before. I filed the issue so that I can handle them later. Will these URLs complicate a solution for #146?

yarikoptic commented 1 year ago

oh, I forgot that those aren't really following W3C standard but rather something git allows for . https://git-scm.com/docs/git-clone

When Git doesn’t know how to handle a certain transport protocol, it attempts to use the remote-<transport> remote helper, if one exists. To explicitly request a remote helper, the following syntax may be used:

<transport>::<address>

so we indeed should make code first split away transport (could be probably simply identified via [^/]+:: regex or made more specific -- see how git does it) and then process address for harmonization.

The 2nd one -- I am no longer sure on where I picked it up but that one is a "more standard" URL, since

❯ python -c 'from urllib.parse import urlparse;print(urlparse("libarchive://deeply/nested/path::ftp:///archive.7z"))'
ParseResult(scheme='libarchive', netloc='deeply', path='/nested/path::ftp:///archive.7z', params='', query='', fragment='')

so we can just leave it at that.

Overall :: also used heavily in fsspec for "chaining" URLS: https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining but I don't know if we should care about that, until we see such somehow being used. @mih do you have constructs for datalad clone which would be such "chained" URLs for datalad-annex:: ?