ctds-usyd / scopus

Extracts Elsevier Scopus Snapshot to relational database
GNU General Public License v3.0
3 stars 2 forks source link

Elsevier Scopus custom data extraction and RDBMS export

This tool extacts data from the Elsevier Scopus Snapshot XML to a more usable Relational Database Structure.

The loader is incremental and atomically consistent, such that it should be possible to start, stop and restart extraction on the same dataset without making the documents stalled inconsistent. This incrementality, however, assumes that the documents and related data are fixed across restarts.

The below documentation describes running the tool and the schema of the database.

Setup and Running

The loading script should ideally be on the same machine as the source data, which can remain zipped (but not encrypted; see below) for the process to run. If it is not on the same machine, it should be mounted/mapped locally.

The loading script need not be on the same machine as the target database.

Windows and MSSQL setup

This presents running the loader Windows for a MSSQL instance. As an alternative, you could run the loader on a non-Windows machine using django-pyodbc.

Python setup

The simplest way to install Python with libxml etc. is to get Anaconda for Python 2 (version at time of writing). Install it and check the option to include Anaconda on the PATH.

Loader setup

Download this repository and extract it to a new directory, say E:\scopus-extract.

In the command prompt, go to E:\scopus-extract and run the following two commands:

E:\scopus-extract> conda install -y lxml
E:\scopus-extract> pip install django-mssql==1.8 --editable=.

Create tables in database

Run the loading

Linux/Unix and MySQL setup

Requirements

You require the following in the loading environment:

MySQL setup

Loader setup

Download this repository and extract it to a new directory, say ~/scopus-extract.

In ~/scopus-extract, run

$ pip install mysqlclient --editable=.

Create tables in database

Run the loading

An example invocation:

$ nohup ./extract_to_db.sh /home/compressed-scopus/2011/ 2> 2011.txt

Schema

We only attmept to represent a limited subset of the fields per document.

Scopus/models.py uses the Django Object Relational Model (ORM) to represent the relational schema for the fields we extract. (Although tested with MySQL, it should be possible to use this ORM with other database backends like PostgreSQL, SQLite, etc.)

The tables ("models" in Django terminology) we store are:

We use Scopus IDs where we can, notably:

While we tend to avoid explicit foriegn keys to avoid database overhead, the following relationships hold between tables:

Fields are described in the models.py file and in the Django-admin interface (see below).

Please note that for a ForeignKey declaration like the following, Django appends _id to create the actual field name stored in the relational DB:

class Document(django.db.models.Model):
    ...
    source = django.db.models.ForeignKey(Source)
    ...

results in:

CREATE TABLE "document" (
...
"source_id" BIGINT NULL REFERENCES "source" ("source_id")
...
);

Limited source details

In order to keep the Source table simple, we do not include many details, such as volume and issue number, or ISBN for monographs.

We also store at most one print and electronic ISSN on each Source record. Since a single journal may have, for instance, changed names over time, and hence may have multiple ISSNs over time, there may be multiple Source records for what Scopus considers a single source ID.

Multiple authorship

Rather than store authors and affiliations as distinct tables, we denormalise the data for authorship. The order of authorship (first author, second author, etc.) is indicated by the Authorship.order fields. There may be multiple Authorship records for each (document, order) pair. This usually happens when an author has multiple affiliations. Thus each record should indicate the relationship between a document, an author, and an affiliation. (Some documents in the dataset lack an affiliation or author ID, so these may be NULL.)

Input

The Scopus data is provided as a series of XML files, grouped together and compressed with Zip, then encrypted with GPG.

Since XML tends to have high compression rates, our loader works directly with Zip files (please decrypt first), rather than worry about disk/network I/O latency for transferring large quantities of raw XML.

Within the Zip, each document is represented by two files, such as:

Here the document's ID, called an EID, is an integer 84893747227. The prefix 2-s2.0- is constant and appears to references the Scopus extraction version.

The first file, 2-s2.0-84893747227.xml stores most of the data about a single article, as defined by this XSD and documented here. citedby.xml stores information about the documents that cite EID 84893747227.

XML to Python dicts

Scopus/xml_extract.py initially extracts the fields we need from XML. It uses XPath via lxml.etree to extract specific elements from the document, and performs some cleaning on the result, to produce an intermediate representation.

Quality assurance in the loader

The relational schema tends to rely on fields having single values, while XML (when licensed by the XSD) allows them to have zero or more values. We try to ensure that we have identified anomalies in the data. To do so we use a helper function, xpath_get_one which will, by default, give a warning if no element matches the XPath query, or if more than one matches the XPath query. After analysing some cases where warnings were triggered, we have resolved issues and disabled warnings where we felt appropriate.

Exceptions raised in the extraction process are also logged.

Log messages here are output as valid JSON so that they can be aggregated and analysed using any tools capable of processing JSON.

Standalone extraction

The extraction code is written to be independent of the database schema and storage. You can therefore:

This allows the extraction to be used to feed into a different storage solution (e.g. a graph database).

Python dicts to relational database

Scopus/db_loader.py manages loading into a relational database, making use of xml_extract. It will output progress messages such as:

WARNING:root:Record 2-s2.0-0142168590.xml from 2003 was processed.

Records are stored in large batches, so progress reporting is infrequent.

Viewing the data with Django-admin

This feature is currently disabled in order to handle the shambles of MSSQL support!

The data and its relationships can be viewed and searched using a largely automatic interface provided by django.

After the data is loaded, first create a user:

$ ./manage.py createsuperuser

then run a server instance (note that runserver is not very secure; see deployment docs for more robust configurations, and do not open runserver to a public IP address):

$ ./manage.py runserver
Performing system checks...
System check identified no issues (0 silenced).
November 09, 2016 - 01:25:22
Django version 1.10.2, using settings 'Scopus.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

Open http://127.0.0.1:8000/ to browse.

Advanced: Log analysis

Most errors and warnings in the loader logs will be output with each log entry as a separate JSON object. They can be analysed using UNIX tools and jq or another tool for analysing JSON.

The following will find the most frequent issues in the log files:

cat *.log | grep -o '{"context.*' |
    jq -c 'del(.context) | del(.length) | del(.exception)' |
    sort | uniq -c | sort -n

The following will print any exception tracebacks:

cat *.log | grep -o '{"context.*' |
    jq -r 'select(.error == "Loading to database failed") |
           "::\(. | del(.exception))::\n \(.exception)"'

Authors

This package has been developed by Nikzad Babaii Rizvandi and Joel Nothman within the Sydney Informatics Hub. Copyright ©2016-2017, University of Sydney.