TICCLAT / ticclat

Python package used in the TICCLAT project
https://www.esciencecenter.nl/project/ticclat
Apache License 2.0
2 stars 0 forks source link
data-science language-evolution ocr spelling-correction

################################################################################ ticclat ################################################################################

TICCLAT is a tool for text-induced corpus correction and lexical assessment.

.. image:: https://travis-ci.org/TICCLAT/ticclat.svg?branch=master :target: https://travis-ci.org/TICCLAT/ticclat

.. image:: https://readthedocs.org/projects/ticclat/badge/?version=latest :target: https://ticclat.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://codecov.io/gh/TICCLAT/ticclat/branch/master/graph/badge.svg :target: https://codecov.io/gh/TICCLAT/ticclat

.. image:: https://api.codacy.com/project/badge/Grade/df3276b470274a088819d2af9c0faaee :target: https://www.codacy.com/manual/egpbos/ticclat?utm_source=github.com&utm_medium=referral&utm_content=TICCLAT/ticclat&utm_campaign=Badge_Grade

.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3520462.svg :target: https://doi.org/10.5281/zenodo.3520462

Installation


To install ticclat, do:

.. code-block:: console

git clone https://github.com/ticclat/ticclat.git cd ticclat pip install .

Run tests (including coverage) with:

.. code-block:: console

python setup.py test

Setup MySQL


Server security

Run sudo mysql_secure_installation with the following choices:

To allow login as any user with the root password set above, you have to switch the authentication plugin for root to mysql_native_password.

.. code-block:: mysql

SELECT plugin from mysql.user where User='root';

what plugin you are using currently. If it is auth_socket (default on Ubuntu), you can only login as root if you are running mysql as the Unix root user, e.g. by running with sudo. To change it to mysql_native_password, start mysql -u root and run

.. code-block:: mysql

UPDATE mysql.user SET plugin = 'mysql_native_password' WHERE User = 'root';

To make this authentication plugin the default, add the following to /etc/my.cnf (or another my.cnf location, run mysqladmin --help to see the locations that mysqld looks for):

.. code-block:: console

[mysqld] default-authentication-plugin = mysql_native_password

Other settings

To run the ingestion script (e.g. the elex lexicon ingestion), the maximum package size has to be high enough. We set it to 41943040 (4194304 was not enough) by setting the following line in /etc/my.cnf:

.. code-block:: console

[mysqld] max_allowed_packet = 42M

To allow for loading CSV files (this is the fastest way of inserting big bulks of records), add:

.. code-block:: console

[mysqld] local_infile=ON

This allows you to run queries like this:

.. code-block:: sql

LOAD DATA LOCAL INFILE '/file.csv' INTO TABLE test FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\';

This loads the file /file.csv from the client, sends it to the server which inserts it into table test. See MySQL Load Data Documentation.

To allow for saving CSV files, add:

.. code-block:: console

[mysqld] secure_file_priv=/data/tmp/mysql

Also, add this to /etc/apparmor.d/usr.sbin.mysqld (restart afterwards: sudo systemctl reload apparmor)

.. code-block:: console

Allow /data/tmp/mysql access

/data/tmp/mysql/ rw, /data/tmp/mysql/** rw,

Make sure the directory /data/tmp/mysql exists and is writable by the mysql user.

Ubuntu

On Ubuntu 18.04, the default mysqld settings in /etc/mysql/mysql.conf.d/mysqld.cnf set the socket to a non-standard location that confuses all the default values in MySQLdb. Change it to /tmp/mysql.sock if you get OperationError: 2006 ... when running ticclat tasks like ingesting corpora or lexica.

Changes to the Database Schema

Important note: Alembic stripts were removed. Use most recent database dumps to get the newest version of the database.

To apply changes to the database schema, we use alembic.

Alembic is configured to read the information needed to connect to the database database from environment variable DATABASE_URL

To migrate the database to the latest database schema run:

.. code-block:: console

alembic upgrade head

Important note: if you are creating the database from scratch, do not use the alembic database migrations. Instead, use SQLAlchemy to create a complete new instance of the database.

Data ingestion


The ticclat package contains scripts for ingesting data into the database. To run the scripts, create an .env file as described under Setup virtual environment_. In the directory where the .env file is located, type python and then:

.. code-block:: console

from ticclat import ingest ingest.run()

You can conigure run() by providing arguments:

The following sources can be ingested (and added to the include and exclude lists):

Flask web app


Preparation

Starting from Ubuntu (18.04), setup the MySQL database. Then clone this directory, install dependencies (conda & libmysqlclient-dev & build-essential e.g. https://docs.conda.io/en/latest/miniconda.html and apt-get update && apt-get install -y libmysqlclient-dev build-essential).

Setup virtual environment

.. code-block:: console

conda create --name ticclat-web conda activate ticclat-web conda install pip

From ticclat directory, install it:

.. code-block:: console pip install -e .

Create a .env file with the following:

.. code-block:: console

DATABASE_URL=mysql://[user]:[pass]@[host]:[port]/[db_name]?charset=utf8mb4&local_infile=1

FLASK_APP=ticclat.flask_app.py FLASK_ENV=production FLASK_DEBUG=0

for DEV:

FLASK_ENV=development

FLASK_DEBUG=1

You can now run a development server using: flask run

Or a production server:

.. code-block:: console

export $(cat .env | xargs) gunicorn ticclat.flask_app.wsgi:app --bind 0.0.0.0:8000 --max-requests 100 --workers 2 --timeout 30

where the last three options may not be necessary, but can be tweaked for stability and performance.

Debugger


If the debugger in e.g. PyCharm isn't working correctly, this might be because test coverage is enabled. Disbable this temporarily by commenting addopts line in setup.cfg:

.. code-block::

[tool:pytest] testpaths = tests

addopts = --cov --cov-report xml --cov-report term --cov-report html

Documentation


.. _README:

Include a link to your project's full documentation here.

Contributing


If you want to contribute to the development of ticclat, have a look at the contribution guidelines <CONTRIBUTING.rst>_.

License


Copyright (c) 2019, Netherlands eScience Center and Meertens Instituut

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Credits


This package was created with Cookiecutter <https://github.com/audreyr/cookiecutter> and the NLeSC/python-template <https://github.com/NLeSC/python-template>.