IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

Permalinks in S3 storage identifiers: treatment of slashes #10771

Open vera opened 1 month ago

vera commented 1 month ago

After creating a dataset in a Dataverse configured for Permalink PIDs and S3 storage, I noticed that the dataset's storage identifier looks like this:

s3://https://example.com/mypermalinks//XYZ...

(where https://example.com/mypermalinks/ is the authority I have configured and XYZ... is the unique part of the identifier)

In this example, the Permalink PID that is displayed in the Dataverse UI is https://example.com/mypermalinks/XYZ....

Note that the storage identifier contains a double slash preceding the unique part, but the PID does not. Is this a bug in the storage identifier generation?

In my S3 management console as well as using the S3 command line tool s3cmd, the double slashes (both the "buggy" slash and the one in https://) are interpreted as a kind of "folder" structure that is weird to navigate.

Here is a snippet from the Amazon S3 docs about object key names:

The console uses the key name prefixes (Development/, Finance/, and Private/) and delimiter ('/') to present a folder structure. ... Amazon S3 supports buckets and objects, and there is no hierarchy. However, by using prefixes and delimiters in an object key name, the Amazon S3 console and the AWS SDKs can infer hierarchy and introduce the concept of folders. The Amazon S3 console implements folder object creation by creating a zero-byte object with the folder prefix and delimiter value as the key. ...

Characters that might require special handling

The following characters in a key name might require additional code handling and likely need to be URL encoded or referenced as HEX. ...

  • ...
  • Forward slash ("/")
  • ...

https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

Is this creation of a folder structure in S3 intended? If yes, I think the folder structure should be created in a way that makes more sense (e.g. don't create a weird empty-name folder because of the double slash at the start of a URL (https://...)).

In any case, I think it might make sense to URL-encode some or all of the slashes inside Permalink PIDs when generating the storage identifier, as suggested by the Amazon docs (e.g. set a storage identifier like s3://https%3A%2F%2Fexample.com%2Fmypermalinks%2F%2FXYZ... (< this would be an option that wouldn't create any nested folder structure inside the S3 bucket)).

What do you think?

qqmyers commented 1 month ago

FWIW: I haven't had time to look into this, but I don't think permalinks starting with https:// or having double slashes in general was considered during the design (versus having a base-url and having a URL form like DOIs with https://doi.org/ ). There are probably some parts of the code that assume the last // is a special separator in the storageidentifier, e.g. that everything before it is the bucket name, which I think would break with URLs as permalink identifiers.

vera commented 1 month ago

I see. I actually didn't consider the base-url setting. Thanks for pointing that out.

I tried again to configure my Permalink PID provider using that setting. The storage identifier looks fine then (no double slashes). However, the permalink itself is now https://example.com/mypermalinks/citation?persistentId=perma:XYZ instead of https://example.com/mypermalinks/XYZ....

Is there an intended way to get permalinks that look like the latter?

qqmyers commented 1 month ago

Nothing implemented I think. The only use case we had for the initial development was for locally minted ones. In other places (e.g. external controlled vocab), we've set up settings with simple ways to substitute values in to create URLs. So here perhaps something like *.resolver-format={pid} (or "https://example.com/mypermalinks/{pid}" - if there are cases where always prepending the base-url doesn't work) where {pid} would be the bare numeric PID. That, or something similar, would be consistent with the earlier design discussions - I'd be happy to review any PR along those lines or discuss further if you think something else is needed.

vera commented 1 month ago

I've just pushed a commit here with an idea of how the handling of the base-url setting could be changed:

https://github.com/vera/dataverse/commit/0ccde224435ba0567b1f09f2e7e169fc47993b65

This would allow more flexible permalink formats, not limited to permalinks including the hardcoded "/citation?persistentId=" string. If a base-url is set, it is used as-is. (For example, this would allow my https://example.com/mypermalinks/{pid} permalinks.) Of course, if the intention is for the permalink to include the "/citation?persistentId=" string, it could be set as part of the base-url. If no base-url is set, this would default to the configured Dataverse host name + "/citation?persistentId=" (same as now).

I've also updated the PidUtilTest so you can see in the code how it would work.

(By the way, I think this would also fix a majority of the other issues we had with permalinks #10768 and #10769)

qqmyers commented 1 month ago

Thanks - looks good to me and simpler than what I suggested. Your solution assumes the PID will always be at the end of the URL to resolve it, which could be true in practice. If not, we can look for options if/when someone has that use case.

johannes-darms commented 1 month ago

@vera When you create the PR could you update the documentation with a hint to not use / and most likely other special character for authority and shoulder as they are used for file paths to store datafiles.

vera commented 1 month ago

Your solution assumes the PID will always be at the end of the URL to resolve it, which could be true in practice. If not, we can look for options if/when someone has that use case.

Sounds good.

The only other issue I see with my suggestion is, it requires updating the Permalink configuration for all installations that have a base-url configured. I don't know if we need to look out for that.

qqmyers commented 1 month ago

Yeah, probably rare at this point. Since this involves MicroProfile settings, which could be jvm-options or something else, I'd suggest just adding a release note in the PR and pointing out that anyone with a perma base-url configured needs to add ?citation... to it (or whatever the correct change is) when upgrading (and avoiding giving specific instructions for the various MicroProfile options or trying to automate the change in some way).

qqmyers commented 1 month ago

@vera When you create the PR could you update the documentation with a hint to not use / and most likely other special character for authority and shoulder as they are used for file paths to store datafiles.

If you want to go farther, I think it would make sense to actually consider these errors and block them.

I'd also suggest some warning about using chars that have special meaning in URLs - I think things like ? and & would break resolver URLs, XML special chars might break xml metadata exports, etc. Perhaps something vague like "In general, PermaLink authority/shoulder values should be alphanumeric. For other cases, admins may need to consider the potential impact of special characters in resolver URLs, exports, etc.

vera commented 1 month ago

Just opened a PR: #10775

For now, we've just added a warning in the docs about special characters in authority/shoulder values, as you suggested.