Parsing WXR (Wordpress eXpress RSS) XML file data

Cross-posted from https://groups.google.com/d/msg/opensiddur-tech/BwTd-_7yZgk/7agCfWQ_BQAJ

I mentioned in the previous email that the site data for opensiddur.org is now available in downloadable WXR (Wordpress eXpress RSS) XML files.

Making these files publicly accessible is mainly intended as a way for researchers to access the site data without having to scrape opensiddur.org. Aside from RSS, there's really no public API (that I know of) for accessing all 960+ posts on opensiddur.org.

But I have another objective for beginning to move our site data into XML. For all of our transcribed text on opensiddur.org, I want to separate our data from its presentation.

A digression: Such a goal should be no surprise to folk watching this project from its early days. Efraim and I envisioned the Open Siddur as a database by which we could serve liturgists sharing new prayers, scholars researching liturgy, and crafters compiling collections of prayers and related work into new prayerbooks. By separating data from its presentation, that data could be presented in an infinity of ways in an infinity of variations. Our project was founded with great hope in 2009 with this in mind. However, by late 2010, it was clear we wouldn't be realizing this vision soon. So, I began to do something simple and useful with my own modest skills -- just to help collect and curate liturgical content contributed by our community on the wordpress site that had up till then mainly served as a blogspace. In that way, opensiddur.org became the CMS it is today. Meanwhile, development continued on our collaborative transcription environment and siddur building web application at app.opensiddur.org.

Back to these WXR files. By themselves they are large, unwieldly XML files containing the raw HTML and postmeta data of every one of the posts and pages of opensiddur.org. It seems to me that the next step in making this data accessible is to parse these files into 960+ individual post files containing both the raw HTML data and relevant postmeta data such as title, author, co-author(s), content license, date published, categories, and tags -- and to do as much as we can to provide that as structured data. (Further steps can link these files to the manifests of page images linked to the Internet Archive, make them into nice JLPTEI conforming XML, and write some XSLT to display them once again in HTML.)

I've had some success in parsing the WXR posts file into individual text files containing the body of each post using a wxr2txt python script I found here: https://gist.github.com/ruslanosipov/b748a138389db2cda1e8

Unfortunately, that script doesn't bother to copy over the postmeta data along with the HTML in the post body. So I'm still trying to figure out what I need to add to this script to better parse the WXR file. (I also noticed that the file seems to choke on the pages WXR file.) So there's room for improvement for folk who want to help out and flex their python skills. The HTML parse module should come of service into service as can be seen in this fork of the script: https://gist.github.com/aegis1980/4d00c381b0eb67f83cf93365cd7b69ad

(For some reason, HTML Parse isn't working for me in my Python install, so if you can get the above fork to work, let me know.)

So have fun experimenting with the site data and this wxr2txt.py script -- and let me know what success you have in parsing the site data.

I've forked ruslanosipov's python script: https://gist.github.com/aharonium/1d148b57e2b8488f68e2f2781ce92e00

and the output is here: https://github.com/aharonium/opensiddur.org/tree/master/posts/HTML

Mainly I've got stuck on grabbing the co-author(s), tags(s), and categorie(s). Also, the license data.

co-authors, tags, and categories look like so:

<category domain="author" nicename="cap-aharon-varady"><![CDATA[aharon.varady]]></category>
<category domain="author" nicename="cap-milton-steinberg"><![CDATA[milton.steinberg]]></category>

The "cap-" prefix for the authors above is a clue that the data is preserved by a wordpress plugin, (co-authors plus) -- which adds a feature to wordpress (multiple author attribution) that it otherwise lacks.

<category domain="post_tag" nicename="20th-century-gregorian"><![CDATA[20th century C.E.]]></category>
<category domain="post_tag" nicename="58th-century-a-m"><![CDATA[58th century A.M.]]></category>
<category domain="post_tag" nicename="english-vernacular-prayer"><![CDATA[English vernacular prayer]]></category>
<category domain="post_tag" nicename="american-jewry-of-the-united-states"><![CDATA[American Jewry of the United States]]></category>
<category domain="category" nicename="thanksgiving-day"><![CDATA[Thanksgiving Day (4th Thursday of November)]]></category>

In this example there's only one category, but often there are more.

The license data appears something like this:

<wp:postmeta>
    <wp:meta_key><![CDATA[open_content_license]]></wp:meta_key>
    <wp:meta_value><![CDATA[<a href='https://creativecommons.org/publicdomain/zero/1.0/'>Creative Commons Zero (CC 0) Universal license</a> a Public Domain dedication]]></wp:meta_value>
</wp:postmeta>

The problem is that there are a lot of wp:meta_key/value pairs!

So it makes sense to me to create an array, store all the key/value pairs, and then grab the data from the right one (for each item).

Only problem is I haven't figured out how to do arrays yet in python, or to query them. Or if there's an easier/faster way.

aharonium / opensiddur.org

Parsing WXR (Wordpress eXpress RSS) XML file data #1

Cross-posted from https://groups.google.com/d/msg/opensiddur-tech/BwTd-_7yZgk/7agCfWQ_BQAJ