retPath as a retrieval path to override relPath

petersilva commented 4 years ago

extracted from an email thread...

From @josusky: Another interesting idea that Peter mentioned is a logical difference between retrieval URL and the relative path or file name. I can confirm that some systems require use of quite complex URLs to provide the "right" data. In the current concept the client uses the relative path for both - construction of the URL needed the retrieve the data as well as a local "path" or file name to store it. Of course, in many cases such data will be consumed without actually creating a local file or without worrying about its readability but if we adopt the use of two separate fields "retPath" and "relPath" with one of them being optional then we will be able to elegantly handle broader range of data sources.

My proposal is the keep "relPath" as a kind of a canonical product instance identifier, while the (optional) "retPath" (if specified) would be used instead of it to construct the URL that the data provider needs to unambiguously identify the data instance in its data store (whatever it is).

petersilva commented 4 years ago

earlier in the thread I had written:

By convention, the topic is derived from the relPath, but that is not a requirement. We have a few use cases, where the topic and the relPath don´t match at all. This usually happens where you have some kind of database behind the service, and so the retrieval path will not match the topic, typically a query URL of some kind. It is used as a last resort method to obtain data from an uncooperative source. So perhaps a name like retrieval path or retPath is a more accurate name.

Issues with using relPath as a Retrieval Path:

The recipient needs to assign their own directory tree and file name on receipt, which may or may not match the upstream peer´s view. The peers in a mesh will exchange a conventional product (ie. a static one with a different path, and a different url, and a given checksum) It cannot be easily compared to the upstream one, and so it does not participate in the mesh. Typically such systems publish url´s for dynamic queries, so the checksum of the result is unknown by the publisher, so you have a static URL, with no checksum, and every time you ask, you are liable to get a different result. If you do, you don´t know whether it is because of some error, or because a new product is available, or...

For people to want to use such URL´s on a consistent basis, one would need for the messages published: -- to include a good, consistent file name (and tree) for the downloaded result, -- to include a checksum that will be expected for the result of the retrieval.

If a downstream third party gets a product from different upstreams that each queries the same source, the same product will end up in the same tree, with the same name (and sum.) Otherwise, mesh participation is more complicated. So while using relPath to have a different value than topic is already permissible... it is in practice a bit of a challenge, other than as a means of obtaining data from external systems ( where external just means data sources that do not fully adopt the pub/sub methodology.)

On the other hand, while the above use of relPath works, it is perhaps not sufficiently obvious and perhaps the whole scheme should be re-cast into something like:

base_url / topic / file_name

So the field in the message is the file name, and the topic header contains the relative tree for placing it (eliminating redundancy in the current scheme)

(optional field) retrievalPath =

would be a rarely present field to override default behaviour. This could be clearer, and save bytes for the 99% case, but it requires re-working the canadian stack to some degree. Such a change might make some sense as an optimization. but I haven´t considered it worthwhile so far. Nobody has asked.

petersilva commented 4 years ago

So @josusky has suggested just overriding relpath with (the usually not present) retPath when it is there. This is a lot easier to transition to than either of the schemes I presented.

Also there is another issue with basing the relative path on the topic header in that paths may contain special characters, and protocols have different restrictions (e.g. using . as a delimiter) so the forward and reverse mapping is fraught. It is probably much simpler for everyone to keep relPath as is.

petersilva commented 4 years ago

worry: sources that require a retPath are very likely to have no idea what to specify for relPath, and may elect to leave it out, which is likely to break Mesh networking to some degree, and make forwarding among different third parties liable to divergence (different pumps placing the same data in different directories because no relPath is given.)

josusky commented 4 years ago

Perhaps I am too optimistic, but I think that all sources will be able to provide a valid and meaningful relPath. For many sources, typical data providers, the relPath will be almost a fixed string - a template that will be configured at the very beginning of the data (and notification) production and rarely touched later. Only a small part (mostly date/time) needs update with each new notification. Of course, we should encourage people to do a basic validation of received notifications and flag the ones without the relPath :-) This said, I am going to update the schema.

petersilva commented 4 years ago

looks good. in your definition, the description of relPath as minimalist metadata is apt. People in different domains seem to agree that every datum having a single, canonical name is a very good quality. the relPath is this name. in cases where supplying the canonical name to the retrieval system isn't sufficient, one can, in addition, supply a retPath.

MetPX / wmo_mesh

retPath as a retrieval path to override relPath #13

Issues with using relPath as a Retrieval Path: