ga4gh / data-repository-service-schemas

A repository for the schemas used for the Data Repository Service.
Apache License 2.0
60 stars 53 forks source link

accession field in drs uri overrides `port`; breaks uri validation #344

Open bwalsh opened 3 years ago

bwalsh commented 3 years ago

All: I may be missing something, but assuming we want the drs identifier to be a uri, do we need to re-think the over-ride of the port field?

As it stands, I don't think it is plug compatible with [http, ftp, s3, gs,....]. This is problematic for typed systems that validate url properties (ie fhir)

From: Alexander Vantol <avantol@uchicago.edu>
Sent: Friday, February 19, 2021 6:28:17 PM
To: Alessandro Culotti; Brian Walsh; Sean Burke; Fan Wang; Kyle Ellrott; Brian O'Connor
Subject: [EXTERNAL] Re: AnVIL: Invalid DRS urls stored in gen3.theanvil.io

Hey all,

Didn't catch up on the whole chain but want to follow up on DRS URI mentioned, as the format provided is valid per our understanding of the GA4GH DRS specification: https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.1.0/docs/#_compact_identifier_based_drs_uris

drs://[provider_code/]namespace:accession ... Both the provider_code and namespace disallow spaces or punctuation, only lowercase alphanumerical characters, underscores and dots are allowed (e.g. [A-Za-z0-9._]).

We worked pretty closely with GA4GH DRS group (and Broad and other collaborators) to ensure that our "prefix" feature of GUIDs was within the bounds of a valid DRS URI for all our commons in the NIH Interoperability context. Note that the accession is also allowed to have alphanumeric and dots and slashes (though there's some nuance around percent-encoding when hitting a GET endpoint in DRS). 

Thanks,
Alex

On Fri, Feb 19, 2021 at 12:14 PM Alessandro Culotti <aculotti@uchicago.edu> wrote:
FYI

---------- Forwarded message ---------
From: Brian Walsh <walsbr@ohsu.edu>
Date: Wed, Feb 17, 2021 at 7:18 PM
Subject: AnVIL: Invalid DRS urls stored in gen3.theanvil.io
To: Alessandro Culotti <aculotti@uchicago.edu>
Cc: Sean Burke <svburke@uchicago.edu>, Fan Wang <fan1@uchicago.edu>, Kyle Ellrott <ellrott@ohsu.edu>, boconnor@broadinstitute.org <boconnor@broadinstitute.org>

Alessandro:

There appears to be invalid urls stored in gen3.anvil.

Specifically,  given the url "drs://dg.ANV0:dg.ANV0/1c772afd-7404-47fc-8636-92d49e314f00", the `port` is non-numeric.

This causes many validators/parsers to fail.  See RFC 3986 below.

I believe simply eliminating the “:bg.ANV0” port suffix might be the easiest fix.
dglazer commented 3 years ago

I believe part of what's happening here is mixing up URI's and URL's. Strings like "drs://..." are URIs, and are not meant to be parsed or resolved as if they were URLs. So "There appears to be invalid urls stored in gen3.anvil" can't be right, since those strings aren't URLs.

ianfore commented 3 years ago

Yes the distinction between URI, URL and URN often get mixed up, and it is indeed a URI.

That doesn't deal with the issue though. I would restate it as the following being an invalid DRS URI. drs://dg.ANV0:dg.ANV0/1c772afd-7404-47fc-8636-92d49e314f00

drs://dg.ANV0:1c772afd-7404-47fc-8636-92d49e314f00 should be sufficient.

This issue is really a duplicate of #340.