averkij / lingtrain-aligner-editor

Extracts parallel corpora from the 2 raw texts in different languages.
Other
34 stars 4 forks source link

Lingtrain Aligner. ML powered application for extracting parallel corpora.

Introduction

Lingtrain Aligner is a tool for extracting parallel corpora from texts in different languages.

Parallel corpus example

Models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. You can also plug your own model using the interface described in models directory. Supported languages list depend on the selected backend model.

Credits

Higher School of Economics logo The project was supported by the Center for Academic Development of Students within the framework of the Competition of initiative collective research projects of students of the National Research University "Higher School of Economics".

Demo

For the quick overview of the alignment process and main functionality you can watch the demo which was helded on the AINL Conference.

Higher School of Economics logo

How-to

Alignment process is pretty straightforward. After you have the app up and running follow the instructions to start the process. To start the app locally see the Running from Docker Hub section.

1. Upload raw texts

Upload

2. Check the splitted documents

Splitted

3. Align documents

Visualization

4. Check the result and edit if needed

Edit

5. Set the quality threshold

Threshold

6. Download the corpora

Dowload

Running on local machine

You can run the application on your computer using docker.

  1. Make sure that docker is installed by typing the docker version command in your console.

  2. Images configured to run locally are available on Docker Hub.

  3. Run the following commads in your console: docker pull lingtrain/aligner:st docker run -p 80:80 lingtrain/aligner:st

  4. App will be available in your browser on the localhost address.

Deployment

You can deploy and run the app on your server using docker.

Prepare the image

On your local machine.

  1. Clone the repo.
  2. Edit the following line in ./fe/src/common/config.js file.
  3. Build the app image. Run in the root folder of the repo:
    • docker build . -t aligner:v1
    • where aligner:v1 is a tag (some king of the image name).
  4. Now you have your image stored locally. You need to push it to Docker Hub.
    • Create an account on Docker Hub. It's a free and publicly available docker registry.
    • Login into your account
      • docker login
    • Tag the image that you've built
      • docker tag aligner my_docker_hub_account/aligner:v1
    • Push the image to registry
      • docker push my_docker_hub_account/aligner:v1
    • After a while your image will be uploaded and can be used for deployment.

Deploy it

On your server.

  1. Make sure that docker is installed by typing the docker version command in your console.
  2. Make directories for storing the app results.
    • mkdir /opt/data /opt/img
  3. Pull the prepared image
    • docker pull my_docker_hub_account/aligner:v1
    • Wait for downloading. After that you will have the image stored locally.
  4. Start the app
    • docker run -v /opt/data:/app/data -v /opt/img:/app/static/img -p [PORT]:80 my_docker_hub_account/aligner:v1
    • where /opt/data, /opt/img are folder on your server
    • and /app/data, /app/static/img are folder inside the container. Don't change them.
    • [PORT] is the port that you have configured while building the image.

Running in development mode

Backend

Flask/uwsgi backend REST API service. It's pretty simple and contains all the alignment logic.

python main.py

Frontend

SPA. Vue + vuex + vuetify. UI for managing alignment process using BE and a tool for translators to edit processing documents.

Setup

npm install

Compile and run with hot-reloads for development

npm run serve

License

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0