Tool to fix bitexts and tag near-duplicates for removal.
--ignore_characters
--ignore_normalization
)--ignore_html
)--ignore_orthography
--ignore_empty
)--ignore_detokenization
)--ignore_duplicates
)
--aggressive_dedup
flag--segmenter
(default is NLTK)--words_before_segmenting
. Set it to 1 to try to segment all sentences.--ignore_segmentation
monofixer.py
instead.If you find Bifixer useful, please consider citing the following paper:
Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas \ "Bifixer and Bicleaner: two open-source tools to clean your parallel data.",\ in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation.\ Lisboa, Portugal: European Association for Machine Translation, November 2020
@InProceedings{prompsit:2020:EAMT,
author = {Gema Ram\'{i}rez-S\'{a}nchez and Jaume Zaragoza-Bernabeu and Marta Ba{\~n}\'{o}n and Sergio Ortiz-Rojas},
title = {Bifixer and Bicleaner: two open-source tools to clean your parallel data.},
booktitle = {Proceedings of the 22nd Annual Conference of the European Association for Machine Translation},
pages = {291--298},
isbn = {978-989-33-0589-8},
year = {2020},
month = {November},
address = {Lisboa, Portugal},
publisher = {European Association for Machine Translation}
}
Install from source:
git clone https://github.com/bitextor/bifixer
cd bifixer
pip install .
Automatic testing was added to ensure that everything is working fine in Bifixer:
cd bifixer
pytest
Or install without manually downloading the repo:
pip install "bifixer @ git+https://github.com/bitextor/bifixer.git"
Or even easier, install directly from PyPI:
pip install bifixer
Also, you can install the conda package:
conda install -c bitextor bifixer
After installing, two executables (bifixer
and monofixer
) will be available to be run.
Please note that, in order to use the optional loomchild
segmenter module in Java, it has to be specified as an optional dependency during installation:
pip install bifixer[loomchild]
In case you are not using Java 8 as default, download it and overwrite the 'JAVA_HOME' variable before installing, for example:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
usage: bifixer.py [-h] [--header] [--scol SCOL] [--tcol TCOL]
[--sdeferredcol SDEFERREDCOL] [--tdeferredcol TDEFERREDCOL]
[--ignore_characters] [--ignore_empty] [--ignore_long]
[--ignore_orthography] [--ignore_detokenization]
[--ignore_duplicates] [--aggressive_dedup]
[--ignore_segmentation] [--ignore_html]
[--words_before_segmenting WORDS_BEFORE_SEGMENTING]
[--segmenter {nltk,loomchild}] [--annotated_output] [--tmp_dir TMP_DIR] [-q]
[--debug] [--logfile LOGFILE] [-v]
input output srclang trglang
positional arguments:
input Tab-separated files to be bifixed
output Fixed corpus
srclang Source language (SL) of the input
trglang Target language (TL) of the input
optional arguments:
-h, --help show this help message and exit
Optional:
--header Input file will have header (default: False)
--scol SCOL Source sentence column (starting in 1). The name of
the field is expected instead of the position if
--header is set (default: 3)
--tcol TCOL Target sentence column (starting in 1). The name of
the field is expected instead of the position if
--header is set (default: 4)
--sdeferredcol SDEFERREDCOL
Source deferred standoff annotation column (starting
in 1). The name of the field is expected instead of
the position if --header is set (default: None)
--tdeferredcol TDEFERREDCOL
Target deferred standoff annotation column (starting
in 1). The name of the field is expected instead of
the position if --header is set (default: None)
--ignore_characters Doesn't fix mojibake, orthography, or other character
issues (default: False)
--ignore_empty Doesn't remove sentences with empty source or target
(default: False)
--ignore_long Doesn't ignore too long sentences (default: False)
--ignore_orthography Doesn't apply orthography fixing (default: False)
--ignore_html Doesn't remove HTML tags (default: False)
--ignore_detokenization
Doesn't fix common tokenization issues (default:
False)
--ignore_duplicates Doesn't obtain the hashes of parallel sentences
(default: False)
--aggressive_dedup Treats similar sentences as duplicates (marking them
with the same hash) (default: False)
--ignore_segmentation
Doesn't change segmentation of long sentences
(default: False)
--words_before_segmenting WORDS_BEFORE_SEGMENTING
Max words allowed in one side of a parallel sentence
before trying to segmentate it. Set to 0 to applicate
segmentation on everything. (default: 15)
--segmenter {nltk,loomchild}
Segmenter module. (default: nltk)
--annotated_output Adds an extra column indicating if the sentence pair was modified
('Yes' if it was modified, otherwise 'No') (default: False)
--tmp_dir TMP_DIR Temporary directory where creating the temporary files
of this program (default: /tmp)
Logging:
-q, --quiet Silent logging mode (default: False)
--debug Debug logging mode (default: False)
--logfile LOGFILE Store log to a file (default: <_io.TextIOWrapper
name='<stderr>' mode='w' encoding='UTF-8'>)
-v, --version show version of this script and exit
--header
is set, the expected value will be the name of the field. Default: 3 if --header
is not set else src_text--header
is set, the expected value will be the name of the field. Default: 4 if --header
is not set else trg_textnltk
or loomchild
). Default: nltkpython3.7 bifixer/monofixer.py --help
usage: monofixer.py [-h]
[--scol SCOL] [--sdeferredcol SDEFERREDCOL]
[--ignore_characters] [--ignore_long]
[--ignore_orthography] [--ignore_detokenization]
[--ignore_duplicates] [--aggressive_dedup]
[--ignore_segmentation] [--ignore_html]
[--words_before_segmenting WORDS_BEFORE_SEGMENTING]
[--segmenter {nltk,loomchild}] [--annotated_output] [--tmp_dir TMP_DIR] [-q]
[--debug] [--logfile LOGFILE] [-v]
input output lang
positional arguments:
input Tab-separated file to be fixed
output Fixed corpus
lang Language of the input
optional arguments:
-h, --help show this help message and exit
Optional:
--header Input file will have header (default: False)
--scol SCOL Sentence column (starting in 1). The name of the
field is expected instead of the position if --header
is set (default: 2)
--sdeferredcol SDEFERREDCOL
Source deferred standoff annotation column (starting
in 1). The name of the field is expected instead of
the position if --header is set (default: None)
--ignore_characters Doesn't fix mojibake, orthography, or other character
issues (default: False)
--ignore_long Doesn't ignore too long sentences (default: False)
--ignore_orthography Doesn't apply orthography fixing (default: False)
--ignore_detokenization
Doesn't fix common tokenization issues (default:
False)
--ignore_html Doesn't remove HTML tags (default: False)
--ignore_duplicates Doesn't obtain the hashes of sentences (default:
False)
--aggressive_dedup Treats similar sentences as duplicates (marking them
with the same hash) (default: False)
--ignore_segmentation
Doesn't change segmentation of long sentences
(default: False)
--words_before_segmenting WORDS_BEFORE_SEGMENTING
Max words allowed in a parallel sentence before trying
to segmentate it. Set to 0 to applicate segmentation
on everyt33hing. (default: 15)
--segmenter {nltk,loomchild}
Segmenter module. (default: nltk)
--annotated_output Adds an extra column indicating if the sentence was
modified ('Yes' if it was modified, otherwise 'No')
(default: False)
--tmp_dir TMP_DIR Temporary directory where creating the temporary files
of this program (default: /tmp)
Logging:
-q, --quiet Silent logging mode (default: False)
--debug Debug logging mode (default: False)
--logfile LOGFILE Store log to a file (default: <_io.TextIOWrapper
name='<stderr>' mode='w' encoding='UTF-8'>)
-v, --version show version of this script and exit
--header
is set, the expected value will be the name of the field. Default: 2 if --header
is not set else src_textnltk
or loomchild
). Default: nltkbifixer input-corpus.en-es output-corpus.en-es en es
bifixer
can be parallelized by using your favourite method (for example, GNU parallel)
Suggested usage:
cat input-corpus.en-es \
| parallel -j 25 --pipe -k -l 30000 bifixer -q - - en es \
> output-corpus.en-es
where the two '-
' mean read from stdin and write to stdout, and the -q
tells bifixer to be quiet in order to avoid logging a lot of information messages.
In order to ease the later removal of duplicated or near-duplicated parallel sentences, Bifixer appends each parallel sentence two new fields: hash
and ranking
.
The hash is obtained by using the XXHash algorithm, applied after fixing source and target sentences (fixed_source+"\t"+fixed_target
). Sentences that are identical at this step (see example below) will get the same hash.
When using the --aggressive_dedup
feature, fixed parallel sentences are also normalized (ignoring casing, accents and diacritics) before their hash is computed. Doing so, sentences that are near-duplicates (i.e. they only differ in casing or accents) will also get the same hash. Normalization is only used internally: the output sentences will not be normalized after Bifixer is applied.
A ranking
column is added at the end of each line. When not using the --aggressive_dedup
feature, the number is set to 1 by default. When using the --aggressive_dedup
feature, a float number is provided. This number (interpreted as the higher the better) will be used at later step to help the deduplication algorithm to choose the best sentence from those sharing the same hash. If the ranking number is exactly the same for a group of sentences sharing the same hash, only a random one should be kept. Otherwise, the one with the highest ranking number should be kept.
Input file:
http://www.ehyz.com/2.html.tmp http://www.ehyz.com/2.html.tmp 1 year ago NuVid Hace 1 año NuVid
http://pandafoundation.com/index.php?page=7 http://pandafoundation.com/index.php?page=26 ©2007 Chengdu Research Base of Giant Panda Breeding ! All Rights Reserved ©2017 Fundación para la Investigación de Cría del Panda Gigante de Chengdu/ ¡Todos los derechos reservados!
http://www.boliviamall.com/4520.html http://www.boliviamall.com/4520.html Welcome Guest 1! Would you like to log in ? Bienvenido Invitado 1! ¿Le gustaria entrar ?
http://pandafoundation.com/index.php?page=157 http://pandafoundation.com/index.php?page=76 ©2007 Chengdu Research Base of Giant Panda Breeding ! All Rights Reserved ©2017 Fundación para la Investigación de Cría del Panda Gigante de Chengdu/ ¡Todos los derechos reservados!
http://www.ehyz.com/6.html.tmp http://www.ehyz.com/6.html.tmp 1 year ago NuVid Hace 1 año NuVid
http://www.boliviamall.com/4305.html http://www.boliviamall.com/4305.html Welcome Guest 12! Would you like to log in? ¡Bienvenido invitado 12! ¿Le gustaria entrar?
Output file (using the '--aggressive_dedup' feature, otherwise ranking number would be 1 in all cases):
http://www.ehyz.com/2.html.tmp http://www.ehyz.com/2.html.tmp 1 year ago NuVid Hace 1 año NuVid 9f1f7c6fc775a23a 88.25
http://pandafoundation.com/index.php?page=7 http://pandafoundation.com/index.php?page=26 ©2007 Chengdu Research Base of Giant Panda Breeding ! All Rights Reserved ©2017 Fundación para la Investigación de Cría del Panda Gigante de Chengdu/ ¡Todos los derechos reservados! d0278d1279f06823 91.93
http://www.boliviamall.com/4520.html http://www.boliviamall.com/4520.html Welcome Guest 1! Would you like to log in ? Bienvenido Invitado 1! ¿Le gustaría entrar ? e8f129b1624b9f5d 91.22
http://pandafoundation.com/index.php?page=157 http://pandafoundation.com/index.php?page=76 ©2007 Chengdu Research Base of Giant Panda Breeding ! All Rights Reserved ©2017 Fundación para la Investigación de Cría del Panda Gigante de Chengdu/ ¡Todos los derechos reservados! d0278d1279f06823 91.93
http://www.ehyz.com/6.html.tmp http://www.ehyz.com/6.html.tmp 1 year ago NuVid Hace 1 año NuVid 9f1f7c6fc775a23a 88.25
http://www.boliviamall.com/4305.html http://www.boliviamall.com/4305.html Welcome Guest 12! Would you like to log in? ¡Bienvenido invitado 12! ¿Le gustaría entrar? 422aeefd8f056b30 92.78
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.