bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Obtain a clean output #3

Closed jgcb00 closed 3 years ago

jgcb00 commented 3 years ago

Hi, I would like to submit an improvement idea. It would be great if we could have an option that gives us a proper output. By a proper output, I mean an option without the duplicated sentences and with the best sentences only. Regards

ZJaume commented 3 years ago

Hi @jgcb00,

The features that you are mentioning were not implemented on purpose. In order to give that responsibility to the user and also because it can fit better on very large pipelines, like ours in Paracrawl. If you don't know how to do it properly, this the pipeline I use:

cat corpus.bifixed \
    | LC_ALL=C sort -k3,3 -k4,4nr \
    | LC_ALL=C sort -k3,3 -u
    > corpus.dedup

The first sort will group duplicated sentences together and the ones with highest score on the top, the second will deduplicate keeping the sentences with higher score.

jgcb00 commented 3 years ago

Hi thank you very much, I used this, to keep the same order:

cat -n dataset_en_nl.sans_seg.0.7 | sort -t$'\t' -k7,7 -k8,8nr | awk -F'\t' '!seen[$7]++' |sort -n |  cut -f2- > dataset_en_nl.sans_seg.dedup.0.7

Thanks for your help !