Cumulus / Syndic

RSS and Atom feed parsing
MIT License
34 stars 13 forks source link

Capture XML attributes? #32

Open Chris00 opened 10 years ago

Chris00 commented 10 years ago

Do we want to capture some XML attributes for Atom feeds, for example xml:lang for content? Example

    <content type="xhtml" xml:lang="en" xml:base="http://diveintomark.org/">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <p><i>[Update: The Atom draft is finished.]</i></p>
      </div>
    </content>
Chris00 commented 10 years ago

xml:base is also useful to resolve relative links in the post content. See also https://github.com/ocaml/platform-blog/issues/12#issuecomment-52685888

dsheets commented 10 years ago

If you implement xml:base, be sure to read http://www.w3.org/TR/xmlbase/ and implement relative bases. Since 1.3.12 or so, Uri supports relative-relative resolution so you can partially resolve against xml:base and emit XML without xml:base included for processors that don't understand it and want to just resolve against retrieval URL or self link.

For extra bonus points, solve the xml:base problem for everyone by releasing xmlmbase or something so I don't have to keep re-implementing it...

Chris00 commented 9 years ago

@dsheets It would be nice if Uri.resolve was slightly more documented — I haven't read the code yet but in view of some toplevel experiments I am not sure what is the point of the scheme. Also, I don't know how you see it but I think it is better for the Atom,... parsed documents to contain the full (i.e., if possible resolved) URIs because the feeds can be merged,...

dsheets commented 9 years ago

What do you mean by "scheme"? The scheme component of a URI? The point is to specify the protocol or resolution method of the rest of the identifier...

As for resolution to absolute identifiers, it depends on what you are manipulating. Given only an XML stream using the Atom vocabulary, the best one can do is resolve with the contained URI bases and URIs. If they are absolute, you will get absolute URIs. If they are relative, you will get relative (but more precise) URIs. Sometimes you don't want to remove xml:base. Often you want to remove it so that other processors don't have to deal with it. If relative URIs are used, you can't resolve them until you have a base URI from the transport protocol or resource retrieval. If you have that information, I absolutely agree that you should use it to resolve relative URIs (if your processing is based on traversing links... if you are transforming the document but will re-serve it later from potentially a different address, you should keep things relative...).

I hope this makes sense. I think for most use cases of this library, the base or absolute retrieval URI will be known and should be used. If you are just writing a function to remove xml:base, you shouldn't use the retrieval identifier, though. I think in Atom's case, relative URLs may be against the self link as well (but I haven't checked the spec). You may not want to resolve those.

And, yes, Uri.resolve should have more extensive documentation.

Chris00 commented 9 years ago

On Sun, 7 Dec 2014 11:03:13 -0800, David Sheets wrote:

What do you mean by "scheme"? The scheme component of a URI? The point is to specify the protocol or resolution method of the rest of the identifier...

(* Resolve a URI against a default scheme and base URI ) val resolve : string -> t -> t -> t

“scheme” refers to the above sentence.

I'll read the rest later.

dsheets commented 9 years ago

Ah, you need to provide a scheme here to direct the resolution regarding scheme-specific behavior so "" or "http" are typical. I'm not too happy about this part of the interface but there are some scheme-dependent resolution rules for host normalization. Specifically, "http" (or "https") will lowercase the hostname per DNS, "file" will also remove "localhost", and "" will perform no host normalization. Unfortunately, there are other scheme-dependencies but they haven't been captured in the library yet.

I've been planning a major update to the interface for several months. Please do post issues, ideas, suggestions, questions, etc to the issue tracker.