JuliaIO / LibExpat.jl

Julia interface to the Expat XML parser library
Other
9 stars 32 forks source link

LibExpat - Julia wrapper for libexpat

Build Status Build status Coverage Status

Usage

XPath queries on fully parsed tree

Has only three relevant APIs

Examples for element_path are:

If only one sub-element exists, the index is assumed to be 1 and may be omitted.

You can also navigate the returned ETree object directly, i.e., without using LibExpat.find. The relevant members of ETree are:

type ETree
    name        # XML Tag
    attr        # Dict of tag attributes as name-value pairs
    elements    # Vector of child nodes (ETree or String)
end

The xpath search consists of two parts: the parser and the search. Calling xpath"some/xpath[expression]" xpath(xp::String) will construct an XPath object that can be passed as the second argument to the xpath search. The search can be used via parseddata[xpath"string"] or xpath(parseddata, xpath"string"). The use of the xpath string macro is not required, but is recommended for performance, and the ability to use $variable interpolation. When xpath is called as a macro, it will parse path elements starting with $ as julia variables and perform limited string interpolation:

xpath"/a/$b/c[contains(.,'\$x$y$(z)!\'')]"

The parser handles most of the XPath 1.0 specification. The following features are currently missing:

Streaming XML parsing

If you do not want to store the whole tree in memory, LibExpat offers the abbility to define callbacks for streaming parsing too. To parse a document, you creata a new XPCallbacks instance and define all callbacks you want to receive.

type XPCallbacks
    # These are all (yet) available callbacks, by default initialised with a dummy function.
    # Each callback will be handed as first argument a XPStreamHandler and the following other parameters:
    start_cdata     # (..) -- Start of a CDATA section
    end_cdata       # (..) -- End of a CDATA sections
    comment         # (.., comment::String) -- A comment
    character_data  # (.., txt::String) -- A character data section
    default         # (.., txt::String) -- Handler for any characters in the document which wouldn't otherwise be handled.
    default_expand  # (.., txt::String) -- Default handler that doesn't inhibit the expansion of internal entity reference.
    start_element   # (.., name::String, attrs::Dict{String,String}) -- Start of a tag/element
    end_element     # (.., name::String) -- End of a tag/element
    start_namespace # (.., prefix::String, uri::String) -- Start of a namespace declaration
    end_namespace   # (.., prefix::String) -- End of the scope of a namespace
end

Using an initialized XPCallbacks object, one can start parsing using xp_streaming_parse which takes the XML document as a string, the XPCallbacks object and an arbitrary data object which can be used to reference some context during parsing. This data object is accessible through the data attribute of the XPStreamHandler instance passed to each callback.

If your data is too large to fit into memory, as an alternative you can use xp_streaming_parsefile to parse the XML document line-by-line (the number of lines read and passed to expat is controlled by the keyword argument bufferlines).

IJulia Demonstration Notebook

LibExpat IJulia Demo