Bicleaner AI (bicleaner-ai-classify
) is a tool in Python that aims at detecting noisy sentence pairs in a parallel corpus. It
indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0).
Sentence pairs considered very noisy are scored with 0.
Although a training tool (bicleaner-ai-train
) is provided, you may want to use the available ready-to-use language packages.
Please, use bicleaner-ai-download
to download the latest language packages or visit the Github releases for lite models and Hugging Face Hub for full models since v2.0.
Visit our docs for a detailed example on Bicleaner training.
If you find Bicleaner AI useful, please consider citing us.
New improved multilingual models for zero-shot classification.
Bicleaner AI is written in Python and can be installed using pip
.
It also requires the KenLM Python bindings with support for 7-gram language models.
Hardrules uses FastSpell that requires cyhunspell
to be installed manually.
You can easily install all the requirements by running the following commands:
pip install bicleaner-ai git+https://github.com/MSeal/cython_hunspell@2.0.3
pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip
After installation, three binary files (bicleaner-ai-train
, bicleaner-ai-classify
, bicleaner-ai-download
) will be located in your python/installation/prefix/bin
directory. This is usually $HOME/.local/bin
or /usr/local/bin/
.
TensorFlow 2 will be installed as a dependency and GPU support is required for training.
pip
will install latest TensorFlow supported version, but older versions >=2.6.5
are supported and can be installed if your machine does not meet TensorFlow CUDA requirements.
See this table for the CUDA and TensorFlow versions compatibility.
In case you want a different TensorFlow version, you can downgrade using:
pip install tensorflow==2.6.5
TensorFlow logging messages are suppressed by default, in case you want to see them you have to explicitly set TF_CPP_MIN_LOG_LEVEL
environment variable.
For example:
TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-classify
WARNING: If you are experiencing slow downs because Bicleaner AI is not running in the GPU, you should check those logs to see if TensorFlow is loading all the libraries correctly.
For Serbo-Croatian languages, models work better with transliteration. To be able score transliterated text, install optional dependency:
pip install bicleaner-ai[transliterate]
Note that this won't transliterate the output text, it will be used only for scoring.
bicleaner-ai-classify
aims at detecting noisy sentence pairs in a parallel corpus. It
indicates the likelihood of a pair of sentences being mutual translations (with a value near to 1) or not (with a value near to 0). Sentence pairs considered very noisy are scored with 0.
By default, the input file (the parallel corpus to be classified) expects at least four columns, being:
but the source and target sentences column index can be customized by using the --scol
and --tcol
flags. Urls are not mandatory.
The generated output file will contain the same lines and columns that the original input file had, adding an extra column containing the Bicleaner AI classifier score.
Bicleaner AI has two types of models, full and lite models. Full models are recommended, as they provide much higher quality. If speed is a hard constraint to you, lite models could be an option (take a look at the speed comparison).
See available full models here and available lite models here.
You can download the model with:
bicleaner-ai-download en fr full
This will download bitextor/bicleaner-ai-full-en-fr
model from HuggingFace and store it at the cache directory.
Or you can download a lite model with:
bicleaner-ai-download en fr lite ./bicleaner-models
This will download and store the en-fr lite model at ./bicleaner-models/en-fr
.
Since 2.3.0 version, full models also accept a local path to download, instead of the HF cache directory. In that case, to use the model, provide the local path instead of the HF identifier.
To read more information about how HF cache works, please read the official documentation.
To classify a tab separated file containing English sentences in the first column and French sentences in the second column, use
bicleaner-ai-classify \
--scol 1 --tcol 2
corpus.en-fr.tsv \
corpus.en-fr.classifed.tsv \
bitextor/bicleaner-ai-full-en-fr
where --scol
and --tcol
indicate the location of source and target sentence,
corpus.en-fr.tsv
the input file,
corpus.en-fr.classified.tsv
output file and bitextor/bicleaner-ai-en-fr
is the HuggingFace model name.
Each line of the new file will contain the same content as the input file, adding a column with the score given by the Bicleaner AI classifier.
Note that, to use a lite model, you need to provide model path in your local file system, instead of HuggingFace model name.
There are multilingual full models available. They can work with, potentially, any language (currently only paired with English) that XLMR supports. To see a further explaination on how to train a multilingual model or how our models perform, take a look here and here.
WARNING: multilingual models will disable hardrules that expect language parameter.
You can, however, overwrite the language code in the model configuration with -s
/--source_lang
or -t
/--target_lang
options during classify. For example when scoring English-Icelandic data, use:
bicleaner-ai-classify \
--scol 1 --tcol 2 \
-t is \
corpus.en-is.tsv \
corpus.en-is.classified.tsv \
bitextor/bicleaner-ai-full-en-xx
Bicleaner AI provides a command-line tool to train your own model, in case available models do not fit your needs. Please go to our training documentation for a quick start and further details.
To set the maximum number of threads/processes to be used during training or classifying, --processes
option is no longer available.
You will need to set BICLEANER_AI_THREADS
environment variable to the desired value.
For example:
BICLEANER_AI_THREADS=12 bicleaner-ai-classify ...
If the variable is not set, the program will use all the available CPU cores.
A comparison of the speed in number of sentences per second between different types of models and hardware:
model | speed CPUx1 | speed GPUx1 |
---|---|---|
full | 1.78 rows/sec | 200 rows/sec |
lite | 600 rows/sec | 10,000 rows/sec |
J. Zaragoza-Bernabeu, M. Bañón, G. Ramírez-Sánchez, S. Ortiz-Rojas, \ "Bicleaner AI: Bicleaner Goes Neural", \ in Proceedings of the 13th Language Resources and Evaluation Conference. \ Marseille, France: Language Resources and Evaluation Conference, June 2022
@inproceedings{zaragoza-bernabeu-etal-2022-bicleaner,
title = {"Bicleaner {AI}: Bicleaner Goes Neural"},
author = {"Zaragoza-Bernabeu, Jaume and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Ba{\~n}{\'o}n, Marta and
Ortiz Rojas, Sergio"},
booktitle = {"Proceedings of the Thirteenth Language Resources and Evaluation Conference"},
month = jun,
year = {"2022"},
address = {"Marseille, France"},
publisher = {"European Language Resources Association"},
url = {"https://aclanthology.org/2022.lrec-1.87"},
pages = {"824--831"},
abstract = {"This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees."},
}
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.