Closed jgcb00 closed 3 years ago
Hi @jgcb00,
The features that you are mentioning were not implemented on purpose. In order to give that responsibility to the user and also because it can fit better on very large pipelines, like ours in Paracrawl. If you don't know how to do it properly, this the pipeline I use:
cat corpus.bifixed \
| LC_ALL=C sort -k3,3 -k4,4nr \
| LC_ALL=C sort -k3,3 -u
> corpus.dedup
The first sort will group duplicated sentences together and the ones with highest score on the top, the second will deduplicate keeping the sentences with higher score.
Hi thank you very much, I used this, to keep the same order:
cat -n dataset_en_nl.sans_seg.0.7 | sort -t$'\t' -k7,7 -k8,8nr | awk -F'\t' '!seen[$7]++' |sort -n | cut -f2- > dataset_en_nl.sans_seg.dedup.0.7
Thanks for your help !
Hi, I would like to submit an improvement idea. It would be great if we could have an option that gives us a proper output. By a proper output, I mean an option without the duplicated sentences and with the best sentences only. Regards