Make control over response parsing more fine-grained

adamretter commented 5 years ago

This gives the user much more control over the response parsing, including the ability to choose the parser for the response when they know better than the response's content type.

Probably needs some editorial to make the language more human friendly, but assuming people are happy with the approach I think the editorial can come in a later PR.

Closes https://github.com/expath/expath-cg/issues/108

ChristianGruen commented 5 years ago

Wow, Adam, I see you spent a lot of time on this. I have some concerns if the solution may not be too complex… But I’ll check the details first and get back to you.

adamretter commented 5 years ago

So yes it does allow some complexity... if you want it. But you don't have to use these new bits! For most people's use-cases the defaults are sane enough to just do what they expect without them having to do anything.

ChristianGruen commented 5 years ago

I’ll start with some practical use cases I get in my mind for this feature, and I’ll compare the old and new proposal (feel free to blame me if I ignored other obvious use cases!). Two notes in advance:

csv:parse and html:parse may be built-in or user-defined functions.
In the "original version", I wrote parse-bodies instead of parse-response, because I think it’s closer to what I wanted it to be (see https://github.com/expath/expath-cg/issues/108#issuecomment-430146707…).

1. Retrieve status of response

Original Version:

http:get($URL, map { 'status-only': true() }) ? headers ? status

New Solution:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-mode-status }
} ? headers ? status

2. Parse result as CSV

Original Version:

http:get($URL, map { 'parse-bodies': false() }) ! csv:parse(?body)

New Solution:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-entity-raw }
}) ! csv:parse(?body)

3. Parse multipart response, return results with media-type text/html as XML:

Original Version:

http:get('http://expath.org/xyz', map { 'parse-bodies': false() }) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

New Solution:

http:get($URL, map {
  'parse-response': map { 'mode': $http:parse-mode-multipart-raw } (: not sure? :)
}) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

To be honest, I’m still not sure if I chose the correct options. And there may be combinations that create identical results. An example (but I may be wrong?):

map { 'parse-response': map { 'entity': $http:parse-entity-none } }
map { 'parse-response': map { 'mode': $http:parse-mode-headers } }

To summarize my feeling: I am not sure if we really need all the flexibility, and if we don’t lose our users here? Maybe we can find a simpler solution again? And maybe we provide some options for which there will hardly be any use cases in the wild (but which we’d all need to check through in an implementation)?

Some thoughts on the status quo:

Maybe it’s too low-level to only return plain raw binaries… I got reminded of this when I saw the additional options for parsing JSON, HTML, … that you added in this pull request.
This requires implementations and converters to be able to process binary data, whereas most existing XQuery functionas to process data are based on strings. After all, XQuery is mainly a string processing language
Additionally, it can be cumbersome and error prone to do pass on the string conversions to the user, as the encoding of the response header needs to be taken into consideration, etc.

So maybe one helpful requirement could be that we can get results as binaries or as strings.

I will choose the other extreme now and provide another minimal suggestion for the parsing option:

`parse` Option	Description
`auto`	implicit parsing (default)
`binary`	return all bodies as binaries
`string`	return all bodies as strings
`skip`	ignore response body

There are various things that are not covered by this proposal. Examples:

Only return the status (all headers will be returned as well when using skip)
Convert HTML to XML (this would need to be done by the user)
Return multipart bodies as a single blob.
Only return headers of multipart results.

Maybe it’s not so bad:

The headers will be returned by the server anyway, and creating the response map should be much cheaper and faster than sending the request.
HTML parsing is standardized nowhere in the XQuery specs, and a user will have more freedom if she can control the behavior of TagSoup or Validator.nu by herself.
I never needed something like this; but maybe there are use cases indeed?
In order to retrieve multipart bodies, the full response needs to be parsed by an implementation, so it would be similarly expensive to keeping the result.

Talking about multipart bodies, one frequent use case is that the returned bodies have different media-types and, thus, need to be parsed differently. In that case, if the implicit processing is not good enough, the relevant bodies would need to be parsed in a subsequent step anyway. This is something (I believe) that is not really covered by any of our current proposals right now. It’s not covered by my last proposal either: Some results might be strings, others might be binary, but we’ll need to treat all of them identically. The EXPath Binary Module allows us to convert binary items to strings (and probably most implementations have their own custom functions for doing that). But it’s something people might ask for if we provide them with a more complex solution. – On the other hand, the big majority of processed responses won’t be multipart, or can hopefully be processed implicitly, so maybe that’s something we shouldn’t lose too much time on.

Many thoughts… @adamretter, what do you think?

ChristianGruen commented 5 years ago

I’ll create a new issue soon in which I’ll summarize some requirements for a more flexible response handling as is available in 1.0.

adamretter commented 5 years ago

@ChristianGruen Let me first make some small corrections to your examples. It's better I do so in a separate comment rather than editing yours, so that we can see the differences.

1. Retrieve status of response

Original Version:

http:get($URL, map { 'status-only': true() }) ? headers ? status

(Christian's example) New Solution:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-mode-status }
} ? headers ? status

New Solution:

http:get($URL, map {
  'parse-response': map { 'mode': $http:parse-mode-status }
} ? headers ? status

Note you need to use mode and not entity here!

2. Parse result as CSV

Original Version:

http:get($URL, map { 'parse-bodies': false() }) ! csv:parse(?body)

(Christian's example) New Solution:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-entity-raw }
}) ! csv:parse(?body)

New Solution:

http:get($URL, map {
  'parse-response': map { 'mode': $http:parse-entity-text }
}) ! csv:parse(?body)

Note I don't know the signature of the csv:parse function, but I am assuming that it takes a string. Otherwise, if you want xs:base64Binary then Christian's solution is correct.

3. Parse multipart response, return results with media-type text/html as XML:

Original Version:

http:get('http://expath.org/xyz', map { 'parse-bodies': false() }) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

(Christian's example) New Solution:

http:get($URL, map {
  'parse-response': map { 'mode': $http:parse-mode-multipart-raw } (: not sure? :)
}) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

New Solution:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-entity-raw }
}) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

Note: You need entity': $http:parse-entity-raw and not 'mode': $http:parse-mode-multipart-raw. The reason is that you want full parsing of the response framing so you need to leave the mode as the default (i.e. $http:parse-mode-full), but you want to disable parsing of each body.

adamretter commented 5 years ago

So to clarify, my approach was two pronged: 1) allow the user flexibility but only if and when they want it 2) not force (the majority?) of users to have to parse http response bodies in their own code

For example, prior to this PR, the HTTP Client Module had the ability to implicitly parse the HTTP response entity(s) to either XML document-node() or JSON (map(), array(), etc).

However, as one example, prior to this PR if the HTTP response was returned with a bad media-type, but the user knew that the content was actually XML or JSON, they were left on their own to parse this into XML or JSON somehow. This is amplified when you consider that the HTTP Client 1.0 module also supported a HTML to XML parsing mechanism.

Consider the use-case 3 from above:

http:get($URL, map {
  'parse-response': map { 'entity': $http:parse-entity-raw }
}) ? body
 [?headers?Content-Type = 'text/html'] ! html:parse(.)

This sadly replies on the user to handle the hard parsing bit. We should empower the user to do that if they wish, but we should not force them to.

IMHO - If we have already had to implement such XML/JSON/HTML->XML parsing in our module implementations anyway, we should likely make it available to the user.

As such, with this PR, use-case 3 from above (assuming a multipart response, with two parts, the first text and the second HTML) can simply be rewritten to:

http:get($URL, map {
  'parse-response': map { 'entity': ($http:parse-entity-text, $http:parse-entity-html-to-xml) }
})

NOTE we no longer need the html:parse function as we already have where needed.

Actually if the Media types in the HTTP response are correct, we could just have used 'entity': $http:parse-entity-auto.

...but wait, `$http:parse-entity-auto happens to be the default, so we can further simplify our use-case 3 to just:

http:get($URL)

...and yes, that gives us the HTML parsing to XML where expected.

adamretter commented 5 years ago

To address some of @ChristianGruen's comments:

To summarize my feeling: I am not sure if we really need all the flexibility, and if we don’t lose our users here? Maybe we can find a simpler solution again?

So I am glad that you wrote some code examples, it is always good to see real syntax. My feeling is that the map nesting, e.g. map { 'parse-response': map { ... } } is not very pretty. I have a few thoughts about this:

The outer map is actually for all request options, so often will contain more than just parse-response, which will make the nesting look less severe. We should also remember that likely >90% of use-cases won't require any parse-response options as the defaults will work for most use-cases. So users will use parse-options rarely.
We could flatten the parse-response grouping, moving the two options mode and entity into the request options as parse-response-mode and parse-response-entity. This would eliminate the nesting... it does make my philosophy of structured data a little uncomfortable, but I could live with it.
I could define further constants for common use-cases, so that instead of:

map {
  'parse-response': map { 'entity': $http:parse-entity-raw }
}

you could just write:

map {
  'parse-response': $http:parse-response-entity-raw
}

Are the aesthetics of the syntax really that bad? Especially in light of (1)? Perhaps it is more the wording of the spec that needs to be improved and further examples added?

Additionally, it can be cumbersome and error prone to do pass on the string conversions to the user, as the encoding of the response header needs to be taken into consideration, etc.

Exactly! I wanted to avoid this where easily possible.

Talking about multipart bodies, one frequent use case is that the returned bodies have different media-types and, thus, need to be parsed differently. In that case, if the implicit processing is not good enough, the relevant bodies would need to be parsed in a subsequent step anyway. This is something (I believe) that is not really covered by any of our current proposals right now.

Got you covered! That is already possible with this PR :-)

If you take a look at the definition for the parse response option entity, you will see that you can supply different parse options for each multipart body for explicit control. Of course the user can also just use the default auto if they want implicit parsing. The wording for entity is:

Where a multipart response is expected, a sequence of values may be specified. If there are less values than multipart parts, then the last value will apply to the remaining multipart parts.

ChristianGruen commented 5 years ago

@adamretter: I see our ideas on response handling still differ. Maybe we could include some other people and see how they judge the different approaches? But I promise I will start another attempt to understand your proposal in full detail.

Did you already have a look at my latest proposal in #127? We could incorporate the idea to also allow sequences for multipart responses (although we should keep in mind that such a solution requires that the order of multipart sections won’t change).

And we should definitely open a separate issue to clarify how we want to handle HTML to XML conversion. In the current specification, there is no parsing rule for HTML data, so http:get($URL) returns a binary item for HTML responses (excluding XHTML). I think this default wouldn’t change, even if we In the current specification. I have deliberately omitted HTML conversion, as it’s highly implementation-defined (see above). But probably you would still prefer to have any HTML option embedded in the spec? Probably we should then allow users to also serialize XML to HTML when sending requests?

ChristianGruen commented 5 years ago

2. Parse result as CSV

New Solution:

http:get($URL, map {
 'parse-response': map { 'mode': $http:parse-entity-text }
}) ! csv:parse(?body)

Yes, I thought that this virtual csv:parse function could handle binary items as well (for BOM detection). But it may be more reasonable for most implementations to pass on a string.

And I mixed up the usage of mode and entity in various cases. Still, I think that entity instead of mode would be correct here?

And (see above)…

map { 'parse-response': map { 'entity': $http:parse-entity-none } }
map { 'parse-response': map { 'mode': $http:parse-mode-headers } }

…won’t these two queries yield the same response? I’m still wondering if there won’t be less confusion if we only had one option?

expath / expath-cg