bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Added new options for deferred crawling standoff annotation #6

Closed lpla closed 3 years ago

lpla commented 3 years ago

This is necessary for a correct reconstruction, as bifixer could split a long input sentence into several ones. This PR implements a modifier for the checksum annotation (usually at columns 6 and 7), indicating where these subsentences are located in the sentence splitter output (given the same sentence splitter in Bitextor/production process and the reconstruction). Then, at the reconstruction time, if this numeric indication is found, only the specified subsentence from the splitter is written in the output.

This pull request relates to https://github.com/bitextor/bitextor/pull/211