UUDigitalHumanitieslab / GrETEL-upload

Upload treebanks for use in GrETEL
http://gretel.hum.uu.nl
MIT License
1 stars 0 forks source link

GrETEL-upload

GrETEL-upload is an extension package for GrETEL that allows to upload your own corpus or dataset. The application will then automatically transform your corpus in an Alpino XML-treebank. After processing, the treebanks are searchable in GrETEL, and if you supply metadata, you can use these for filtering and analysis.

Local installation

Requirements

On top of a default LAMP installation (with PHP 7.*; PHP 8 is currently not working), the following packages are required:

GrETEL-upload also requires the following external programs to be installed:

It is also possible to install using pip:

pip install -r requirements.txt

Make sure to modify [config/common.php] (see below) to point to the install location of corpus2alpino.

Configuration

You will have to provide configuration details in four files:

An example configuration for each can be found in application/config/{NAME}_default.php .

Update the apache config, to allow read-write access to gretel-upload (and gretel).

Database schema

Create the mysql database gretel_upload You can use the command php index.php migrate in the source directory to create/migrate the database schema. See docs/schema.png for the current database schema (exported from phpMyAdmin).

Permissions

Make sure the uploads directory is writable for the user running the Apache daemon (usually www-data ). Also create a writable sessions directory and refer to its absolute path in application/config/config.php if using the default files session driver.

Start-up

Start both Alpino and BaseX as server instances by running the following two commands:

basexserver -S
./alpino.sh

Then, navigate to the installation directory in your web browser (e.g. localhost/gretel-upload/ ) to start using GrETEL-upload.

Production: Cron Task

For production servers, a cron job is required for processing uploaded treebanks. Schedule the following e.g. every 5 minutes:

/usr/bin/php {root}/index.php cron process

Uploading corpora

Formats

Currently, three formats are supported: LASSY-XML, CHAT and plain text (UTF-8 encoded). When you upload a set of texts (always in a zipped folder, possibly consisting of multiple directories), you can specify whether the text is already sentence- and/or word-tokenized. If not, the application will do this for you.

Metadata

GrETEL-upload allows metadata annotation using the PaQu metadata format. This metadata will be converted to LASSY-XML during import.

The GrETEL-upload interface then allows you to select which facet you would want to use to filter the data in GrETEL. You can e.g. choose to display a metadata column called 'year' as a slider, dropdown list or set of checkboxes. You can also choose to hide certain columns.

Libraries

PHP

GrETEL-upload is written in PHP and created with CodeIgniter 3.1.11. The application uses the following libraries:

Javascript

GrETEL-upload uses the following JavaScript libraries:

CSS

GrETEL-upload is created with Pure CSS.

Images

GrETEL-upload uses the FamFamFam silk icon set.

API

GrETEL-upload has an API for retrieving data from the database:

Tests

The test suite is created using ci-phpunit-test. This uses PHPUnit. You can run the tests by navigating to the application/tests directory and calling phpunit .

Demo

A working version is available on http://gretel.hum.uu.nl.