flying-sheep / rust-rst

a reStructuredText parser and renderer
https://crates.io/crates/rst
Apache License 2.0
98 stars 10 forks source link

Find a representation for relative URLs. #10

Closed andreubotella closed 5 years ago

andreubotella commented 5 years ago

While working on hyperlink references as part of issue #7, I realized that we need a way to represent relative URLs in the RST doctree. The servo/rust-url crate we're using can parse relative URL strings with a base, but can only store absolute URLs. Unfortunately, looking through the first pages of crates.io, nobody seems to have tried to reinvent the wheel in that matter.

For the time being, I'm leaving relative URL references unimplemented. In the meantime I'll fork and play around with the url crate to try and make a version that can represent relative URLs.

flying-sheep commented 5 years ago

It’s been addressed in servo/rust-url#487. URLs can’t be relative, URIs can. http::uri::Uri exists! Our alternatives:

  1. Each reference knows which path the document it comes from has.
  2. Resolve relative URIs into URLs while parsing. This is a destructive operation.

Question: How do URIs interact with substitutions and .. include ::? Will the source document be relevant for relative path resolution or the document it ends in after transclusion?

andreubotella commented 5 years ago

Thanks for bringing that to my attention, I wasn't aware that that was brought up in servo/rust-url.

About URLs being relative, as a web dev by day I prefer taking my definitions from the WHATWG URL standard, which effectively deprecates the term URI. Sure, that standard doesn't give a way for browsers to represent relative URLs as a data structure, but it does define them as being URLs. But I'm fine with following the URI RFCs.

About the include directive, it seems to simply take the tree parsed by the included file and copy it into the document as is, without bothering to resolve paths. But if the included files in turn have include directives, those are resolved and replaced before the tree is included in the parent document.

Now, I haven't dug into the docutils code for this, but it seems that after validating a URL they simply store it as text in the refuri field, rather than keeping a URI data structure and serializing that on the other end. We could consider doing the same. The URL grammar I've written in the parser as part of my upcoming PR is too broad to identify a valid absolute URL, and I'm using Url::parse in the conversion stage and returning a string rather than a reference if it fails. We could even check for valid relative URLs that way by parsing with a dummy base URL.

flying-sheep commented 5 years ago

You’re right, it’s probably easiest to simply create a wrapper type that just validates if we want to construct it with a valid URL: What we need is the ability to represent links. Any link can have a scheme, host, port, path, query and fragment. Neither crate of http::uri and url seems to support this, as what http::uri calls “relative URIs” still contain absolute paths (“relative” here means “same host and scheme”):

use http::Uri;

fn main() {
    println!("{:?}", "/foo/bar?baz".parse::<Uri>());
    // Ok(/foo/bar?baz)
    println!("{:?}", "./foo/bar?baz".parse::<Uri>());
    // Err(InvalidUri(InvalidFormat))
}
andreubotella commented 5 years ago

I put this issue on hold while I tried to figure how to parse standalone hyperlinks (which can't have punctuation at the end) on pest, but now that I'm back to it, I see that the crate::target::Target enum seems to be meant to represent URLs that could be absolute or relative. However, the relative URL variant uses PathBuf, which is meant for storing a filesystem path. That means that the internal representation of the path is whatever the compile target platform uses for representing filesystem names.

Now, Target is used for marking URLs / URIs in the extra attributes of different elements, but it's also used as the source common attribute. In docutils, source is a filesystem path (apparently never an absolute URL) indicating the source file in the root document and in system messages. For that it would make sense to use simply PathBuf.

For every other use, we could keep Target as a type (perhaps renamed) but turn it into a tuple struct that contains a String and validates the URL on construction as you suggest, something like:

#[derive(Debug,PartialEq,Serialize,Clone)]
#[serde(transparent)]
pub struct Target(String);
impl Target {
    pub fn from_absolute_url(input: &str) -> Result<Self, url::ParseError> {
        let parse = Url::parse(input)?;
        Ok(Target(parse.into_string()))
    }
    pub fn from_relative_url(input: &str) -> Result<Self, url::ParseError> {
        unimplemented!()
    }
}
flying-sheep commented 5 years ago

Yes, you’re right, I didn’t think about compile location specificity of PathBuf or the fact that it could include a query or fragment part. Why would you like to rename it though?

andreubotella commented 5 years ago

Well, the name Target suggests any kind of reference target – not just a URL but an internal target reference. Something like GenericUrl or UrlString would be more appropriate.

flying-sheep commented 5 years ago

Sure, let’s just call it Url then!

andreubotella commented 5 years ago

Well, we're already using url::Url, but it seems all the cases where it was used can be replaced by Target, so I guess we can treat the former Target as a wrapper around it.