INL / corpus-frontend

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.
16 stars 7 forks source link
corpus

BlackLab Frontend

About

Intro

This is a corpus search application that works with BlackLab Server. At the Dutch Language Institute, we use it to publish our corpora such as CHN (CLARIN login required), Letters as Loot and AutoSearch (CLARIN login required).

How to use

Help is contained in the application in the form of a page guide that can be opened by clicking the button on the right of the page.

Installation

Requirements

Download a release

Releases can be downloaded here.

Building from source

For further development and debugging help, see the Development section.

Using Docker

Make sure you enable BuildKit (e.g. export DOCKER_BUILDKIT=1) before building the image.

To create a container with BlackLab Frontend and Server, run:

docker-compose up --build 

The config file ./docker/config/corpus-frontend.properties will be mounted inside the container. (if you need to change some settings, you can set the CONFIG_PATH environment variable to read corpus-frontend.properties from a different directory).

If you have an indexed BlackLab corpus that you want to access, you can set CORPUS_DIR to this directory and CORPUS_NAME to the name this corpus should have, e.g.:

CORPUS_DIR="/tmp/mycorpus" CORPUS_NAME="my-awesome-corpus" docker-compose up --build

See next section for how to configure BlackLab Frontend.

Configuration

Main configuration file

Corpus-Frontend is configured using a properties file.

The application will normally look for this file in the same places as BlackLab. That is, the following locations, starting from the top:

The file name must be the same as the context path of the corpus-frontend application. That's the URL under which the corpus-frontend is reachable in the browser. Often, if you don't configure the context path, the context path will be the name of the .war file.

Examples:

Example file (most values shown here are the default values):


# The url under which the back-end can reach blacklab-server.
# Separate from the front-end to allow connections for proxy situations
#  where the paths or ports may differ internally and externally.
blsUrl=http://localhost:8080/blacklab-server/

# The url under which the client can reach blacklab-server.
blsUrlExternal=/blacklab-server/

# The url under which the client can reach the corpus-frontend.
# May be needed if the corpus-frontend is behind a proxy that changes the url.
# This setting actually defaults to the contextPath of the servlet, so this is just an example.
cfUrlExternal=/corpus-frontend/

# Optional directory where you can place files to further configure and customize
#  the interface on a per-corpus basis.
# Files should be placed in a directory with the name of your corpus, e.g. files
#  for a corpus 'MyCorpus' should be placed under 'corporaInterfaceDataDir/MyCorpus/'.
corporaInterfaceDataDir=/etc/blacklab/projectconfigs/

# Optional directory for default/fallback settings across all your corpora.
# The name of a directory directly under the corpusInterfaceDataDir.
# Files such as the help and about page will be loaded from here
#  if they are not configured/available for a corpus.
# If this directory does not exist or is not configured,
#  we'll use internal fallback files for all essential data.
corporaInterfaceDefault=default

# Path to frontend javascript files (can be configured to aid development, e.g.
#  loading from an external server so the java web server does not need
#  to constantly reload, and hot-reloading/refreshing of javascript can be used).
jspath=/corpus-frontend/js

# An optional banner message that shows above the navbar.
#  It can be hidden by the user by clicking an embedded button, and stores a cookie to keep it hidden for a week.
#  A new banner message will require the user to explicitly hide it again.
# Simply remove this property to disable the banner.
bannerMessage=<span class="fa fa-exclamation-triangle"></span> Configure this however you see fit, HTML is allowed here!

# Disable xslt and search.xml caching, useful during development.
cache=true

# Show or hide the debug info checkbox in the settings menu on the search page.
# N.B. The debug checkbox will always be visible when using webpack-dev-server during development.
# It can also be toggled by calling `debug.show()` and `debug.hide()` in the browser console.
debugInfo=false

# Set the "withCredentials" option for all ajax requests made from the client to the (blacklab/frontend)-server. 
# Passes authentication cookies to blacklab-server.
# This may be required if your server is configured to use authentication.
# NOTE: this only works if the frontend and backend are hosted on the same domain, or when the server does not pass "*" for the Access-Control-Allow-Origin header. 
withCredentials=false

# Make the server side of corpus-frontend pass some authentication headers to BlackLab 
# The following property is proxied to BlackLab 
# In this case, the Authorization header, which will be sufficient for most needs (basic auth, oauth2, oidc)
# When running behind something like oauth2-proxy, you could set these to x-forwarded-email for example, to pass along the email header from corpus-frontend to BlackLab (BlackLab will need its AuthSystem to be configured to use this header as well, of course)
auth.source.name=Authorization
auth.source.type=header
auth.target.name=Authorization
auth.target.type=header

Adding corpora

Corpora may be added manually or uploaded by users (if configured).

After a corpus has been added, the corpus-frontend will automatically detect it, a restart should not be required.

Allowing users to add corpora

Configuring BlackLab

To allow this, BlackLab needs to be configured properly (user support needs to be enabled and user directories need to be configured). See the BlackLab documentation.

When BlackLab is properly configured, two new sections will appear on the main corpus overview page. They allow you to define your own configurations to customize how blacklab will index your data, create private corpora (up to 10 by default, but can be customized in BlackLab), and add data to them.

Per corpus configuration is not supported for user corpora created through the Corpus-Frontend.
This means adding directories for user corpora in corporaInterfaceDataDir won't work.

Formats

Out of the box, users can create corpora and upload data in any of the formats supported by BlackLab (tei, folia, chat, tsv, plaintext and more). In addition, users can also define their own formats or extend the builtin formats.

Index url

There is also a hidden/experimental page (/corpus-frontend/upload/) for externally linking to the corpus-frontend to automatically index a file from the web. It can be used it to link to the frontend from external web services that output indexable files. It requires user uploading to be enabled, and there should be a cookie/query parameter present to configure the user name (depending on how BlackLab's authentication is configured, the frontend doesn't care and just passes everything along). Parameters are passed as query parameters:

file=http://my-service.tld/my-file.zip
# optional
format=folia
# optional
corpus=my-corpus-name

If the user does not own a corpus with this name yet, it's automatically created. After indexing is complete, the user is redirected to the search page.

Customizing the Frontend

Customization options have gradually grown over the years, and have become a little cumbersome. We're aware and we'll eventually improve on this.

As an admin you can customize various aspects of the frontend, such as what data is shown in the results table, which filters are shown, etc.
Corpora can be individually customized/configured.
Corpora uploaded by users cannot be individually configured, user corpora will however use the default set of customizations if they exist.

First create the needed files

  1. Create the main customization directory. The default location is /etc/blacklab/projectconfigs/ on linux, and C:\\etc\blacklab\projectconfigs\ on windows. It can be changed by changing the corporaInterfaceDataDir setting in corpus-frontend.properties.
  2. (Optionally) create a default configuration: create a directory default/ inside the config dir. It can be changed by changing the corporaInterfaceDefault setting in corpus-frontend.properties.
  3. Create a separate directory for every corpus you want to configure. The names should be equal to the ID of the corpus in BlackLab.
  4. In the corpus' directory, create a static/ dir, files in this this will be available in the browser under corpus-frontend/my_corpus/static/.... This directory can be used for custom css and js, or whatever other files you need.

You should be left with the following directory structure:

etc/projectConfigs/ # the location set in the corporaInterfaceDataDir setting
  corpus-1/
    search.xml # see below.
    help.inc # see below.
    about.inc # see below.
    article.xsl # see below.
    meta.xsl # see below.
    static/
      # files needed by the website go here.
      custom.css
      custom.search.js # see below for what you can do in javascript.
      custom.article.js
  corpus-2/
    ...
  default/ # the name set in the corporaInterfaceDefault setting
    # fallbacks/default configuration goes here
    ... 

Good to know: customization and static files 'overlay' each other, the frontend will check the following locations in order, using the first location where a file is found:

  • the directory of the corpus itself (corporaInterfaceDataDir)
  • The default dir
  • Inside the WAR

Example customization.

Let's perform a simple customization that will take you through the steps, adding a custom javascript file and change the displayed title of your documents.

  1. Follow the steps above to create the config directory for your corpus. I'll assume you left the config directory at its default location of /etc/projectsconfigs/ and your corpus is called example in the following steps. Use your custom paths if necessary.
  2. Copy the default search.xml into etc/projectconfigs/example/search.xml.
  3. Add a config option to include a custom script on the search page: <CustomJs page="search">${request:corpusPath}/static/js/custom.search.js</CustomJs>
  4. Create a matching javascript file /etc/projectconfigs/example/static/js/custom.search.js
  5. Add the following snippet to your custom.search.js
    vuexModules.ui.getState().results.shared.getDocumentSummary = function(metadata, specialFields) {
      return 'This is everything we know about the document: ' + JSON.stringify(metadata);
    }
  6. Now restart your server and perform a search in your corpus, and see the new titles! http://localhost:8080/corpus-frontend/example/search/docs?patt="" NOTE: You don't need to restart the application constantly, simply set cache=falsein the main corpus-frontend.properties config file to disable caching of files by the server.

Details about what customization/configuration file does what:

Blacklab's Index settings (index format)

The term format refers to the *.blf.yaml or *.blf.json file used to index data into the corpus.

Because the format specifies the shape of a corpus (which metadata and annotations a corpus contains, what type of data they hold, and how they are related), it made sense to let the format specify some things about how to display them in the interface.

NOTE: These properties need to be set before the first data is added to a corpus, editing the format config file afterwards will not work (though if you know what your are doing, you can edit the indexmetadata.yaml or indexmetadata.json file by hand and perform the change that way).

Through the format you can:

Custom JS

Custom javascript files can be included on any page by adding them to search.xml

NOTE: by default, your script will run on every page! Not all functions shown below are available everywhere! It is highly recommended to use multiple scripts, and only include them on a single page (by using the <CustomJS page="..."/> (see search.xml). All javascript should run before $(document).ready unless otherwise stated.

Through javascript you can do many things, but outlined below are some of the more interesting/useful features on the /search/ page:


The /docs/ page has other features that can be enabled. Enabling any of these will show a new Statistics tab next to the default Content and Metadata tabs.

Custom CSS

We have included a template SASS file here to allow you to customize your page's color theme easily. From there you can then add your own customizations on top.

Create a file with the following contents

// custom.scss

$base: hsl(351, 70%, 36%); // Defines the base color of the theme, this can be any css color
@import 'style-template.scss'; // the absolute or relative path to our template file

// Your own styles & overrides here ...

You now need to compile this file by following the following steps:

You will now have a custom.css file you can include in your install through search.xml.

Development

Frontend Javascript

The app is primarly written in Vue.js. Outlined here is the /search/ page, as it contains the majority of the code.

Application structure

Entry points are the following files

Individual components are contained in the pages directory. These components are single-use and/or connected to the store in some way. The components directory contains a few "dumb" components that can be reused anywhere.

The Vuex store

We use vuex to store the app state, treat it as a central database (though it's not persisted between sessions). The vuex store is made up of many modules that all handle a specific part of the state, such as the metadata filters, or the settings menu (page size, random seed).

The form directory contains most of the state to do with the top of the page, such as filters, query builder, explore view. The results directory handles the settings that directly update the results, such as which page is open, how results are grouped, etc.

A couple of modules have slightly different roles:

URL generation and parsing

The current page url is generated and updated in streams.ts. It contains a few things: a stream that listens to state changes in the vuex store, and is responsible for updating the page url, and a couple streams that fetch some metadata about the currently selected/searched corpus (shown below the filters and at the top of the results panel).

Url parsing happens in the UrlStateParser. The url parsing is a little involved, because depending on whether a tagset is provided it can differ (the cql pattern is normally parsed and split up so we know what to place in the simple and extended views, but this needs to happen differently when a tagset is involved). Because of this, the store is first initialized (with empty values everywhere), then the url is parsed, after which the state is updated with the parsed values (see search.ts). When navigating back and forth through browser history, the url is not parsed, instead the state is attached to the history entry and read directly.

Internationalization

The app is internationalized using vue-i18n. Please note that the app is only partially translatable right now; I18n is a work in progress. Contributions are welcome.

If you want to help add translation keys, look for e.g. {{ $t('search.simple.heading') }} in the code to see how it's done.

If you want to help translate the app to a new language, you can do so by adding a new language file in the src/frontend/src/locales directory. This is where the default translation files live. Copy one of the files (e.g. en.json) and name it for the new locale (e.g. fr.json for French). Then you can start translating the strings.

You can also override some default translations per corpus by creating a directory named locales in the static directory of the corpus' interface data dir (see the corporaInterfaceDataDir setting) and create a file with the same name as above (e.g. fr.json for French) with the desired overrides. The file should be read automatically by the app.

Development tips

Install the Vue devtools! (chrome, firefox).

You can compile and watch the javascript files using webpack. Execute npm run start in the src/frontend/ directory. This will start webpack-dev-server (webpack is a javascript build tool) that will serve the compiled files (the entry points) at localhost/dist/. It adds a feature where if one of those files is loaded on the page, and the file changes, your page will reload automatically with the new changes.

Combining this with jspath in corpus-frontend.properties we can start the corpus-frontend as we normally would, but sideload our javascript from webpack-dev-server and get realtime compilation :)

# No trailing slash!
jspath=http://localhost:8081/dist
cd corpus-frontend/src/frontend/
npm run start

One note is that by default the port is 8080, but we changed it to 8081, as tomcat already binds to 8080. To change this, edit the scripts.start property in package.json.

Backend development

The backend is written in Java, and does comparitively little. Its most important tasks are serving the right javascript file and setting up a page skeleton (with Apache Velocity).

When a request comes in, the MainServlet fetches the relevant corpus data from BlackLab, reads the matching search.xml file, and determines which page to serve (the *Response classes). Together this renders the header, footer, defines some client side variables (mainly urls to the corpus frontend server and blacklab servers). From there on out the rest happens clientside.

It also handles most of the document page, retrieving the xml and metadata and converting it to html.





If you have any further questions or experience any issues, please contact Jan Niestadt and/or Koen Mertens.

Like BlackLab, this corpus frontend is licensed under the Apache License 2.0.