avikpaul / google-url

Automatically exported from code.google.com/p/google-url
Other
0 stars 0 forks source link
                   ==============================
                   The Google URL Parsing Library
                   ==============================

This is the Google URL Parsing Library which parses and canonicalizes URLs. Please see the LICENSE.txt file for licensing information.

Features

Example

An example implementation of a URL object that uses this library is provided in src/gurl.*. This implementation uses the "application integration" layer discussed below to interface with the low-level parsing and canonicalization functions.

Building

The canonicalization files require ICU for some UTF-8 and UTF-16 conversion macros. If your project does not use ICU, it should be straightforward to factor out the macros and functions used in ICU, there are only a few well- isolated things that are used.

TODO(brettw) ADD INSTRUCTIONS FOR GETTING ICU HERE!

logging.h and logging.cc are Windows-only because the corresponding Unix logging system has many dependencies. This library uses few of the logging macros, and a dummy header can easily be written that defines the appropriate things for Unix.

Definitions

"Standard URL": A URL with an "authority", which is a hostname and optionally a port, username, and password. Most URLs are standard such as HTTP and FTP.

"File URL": A URL that references a file on disk. There are special rules for this type of URL. Note that it may have a hostname! "localhost" is allowed, for example "file://localhost/foo" is the same as "file:///foo".

"FileSystem URL": A URL referring to a file reached via the FileSystem API described at http://www.w3.org/TR/file-system-api/. These are nested URLs, with compound schemes of e.g. "filesystem:file:" or "filesystem:https:". Parsed FileSystem URLs will have a nested inner_parsed() object containing information about the inner URL.

"Path URL": This is everything else. There is no standard on how to treat these URLs, or even what they are called. This library decomposes them into a scheme and a path. The path is everything following the scheme. This type of URL includes "javascript", "data", and even "mailto" (although "mailto" might look like a standard scheme in some respects, it is not).

Design

The library is divided into four layers. They are listed here from the lowest to the highest; you can use any portion of the library as long as you embed the layers below it.

  1. Parsing

    At the lowest level is the parsing code. The files encompassing this are url_parse.* and the main include file is src/url_parse.h. This code will, given an input string, parse it into the most likely form of a URL.

Parsing cannot fail and does no validation. The exception is the port number, which it currently validates, but this is a bug. Given crazy input, the parser will do its best to find the various URL components according to its rules (see url_parse_unittest.cc for some examples).

To use this, an application will typically use ExtractScheme to determine the type of a given input URL, and then call one of the initialization functions: "ParseStandardURL", "ParsePathURL", or "ParseFileURL". This will result in a "Parsed" structure which identifies the substrings of each identified component.

  1. Canonicalization

    At the next highest level is canonicalization. The files encompasing this are url_canon.* and the main include file is src/url_canon.h. This code will validate an already-parsed URL, and will convert it to a canonical form. For example, this will convert host names to lowercase, convert IP addresses into dotted-decimal notation, handle encoding issues, etc.

This layer will always do its best to produce a reasonable output string, but it may return that the string is invalid. For example, if there are invalid characters in the host name, it will escape them or replace them with the Unicode "invalid character" character, but will fail. This way, the program can display error messages to the user with the output, log it, etc. and the string will have some meaning.

Canonicalized output is written to a CanonOutput object which is a simple wrapper around an expanding buffer. An implementation called RawCanonOutput is proivided that writes to a raw buffer with a fixed amount statically allocated (for performance). Applications using STL can use StdStringCanonOutput defined in url_canon_stdstring.h which writes into a std::string.

A normal application would call one of the four high-level functions "CanonicalizeStandardURL", "CanonicalizeFileURL", "CanonicalizeFileSystemURL", and CanonicalizePathURL" depending on the type of URL in question. Lower-level functions are also provided which will canonicalize individual parts of a URL (for example, "CanonicalizeHost").

Part of this layer is the integration with the host system for IDN and encoding conversion. An implementation that provides integration with the ICU (http://www-306.ibm.com/software/globalization/icu/index.jsp) is provided in src/url_canon_icu.cc. The embedder may wish to replace this file with implementations of the functions for their own IDN library if they do not use ICU.

  1. Application integration

    The canonicalization and parsing layers do not know anything about the URI schemes supported by your application. The parsing and canonicalization functions are very low-level, and you must call the correct function to do the work (for example, "CanonicalizeFileURL").

The application integration in url_util.* provides wrappers around the low-level parsing and canonicalization to call the correct versions for different identified schemes. Embedders will want to modify this file if necessary to suit the needs of their application.

  1. URL object

    The highest level is the "URL" object that a C++ application would use to to encapsulate a URL. Embedders will typically want to provide their own URL object that meets the requirements of their system. A reasonably complete example implemnetation is provided in src/gurl.*. You may wish to use this object, extend or modify it, or write your own.

Whitespace

Sometimes, you may want to remove linefeeds and tabs from the content of a URL. Some web pages, for example, expect that a URL spanning two lines should be treated as one with the newline removed. Depending on the source of the URLs you are canonicalizing, these newlines may or may not be trimmed off.

If you want this behavior, call RemoveURLWhitespace before parsing. This will remove CR, LF and TAB from the input. Note that it preserves spaces. On typical URLs, this function produces a 10-15% speed reduction, so it is optional and not done automatically. The example GURL object and the url_util wrapper does this for you.

Tests

There are a number of _unittest.cc and _perftest.cc files. These files are not currently compilable as they rely on a not-included unit testing framework Tests are declared like this: TEST(TestCaseName, TestName) { ASSERT_TRUE(a); EXPECT_EQ(a, b); } If you would like to compile them, it should be straightforward to define the TEST macro (which would declare a function by combining the two arguments) and the other macros whose behavior should be self-explanatory (EXPECT is like an ASSERT, but does not stop the test, if you are doing this, you probably don't care about this difference). Then you would define a .cc file that calls all of these functions.