bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal
GNU General Public License v3.0
29 stars 3 forks source link

Add sentence enumeration to paragraph identification when sentences are split #12

Closed cgr71ii closed 2 years ago

cgr71ii commented 2 years ago

If paragraph identification data is provided in the input file and sentences are split, we would like to keep the sentence enumeration in order to reconstruct the split sentences.

The new behaviour is very similar to deferred (flags --sdeferredcol and --tdeferredcol). So, flags --sparagraphid and --tparagraphid are introduced in order to specify the columns of the source and target paragraph identification data.

If sentences are split, the value #{no. sentence} will be added to source and target paragraph identifiers, like it is done with deferred.

ZJaume commented 2 years ago

Is it expected to add two ids?

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1" | monofixer --scol 1 --sparagraphid 2 - - en 
2022-02-14 17:39:19,039 - INFO - Arguments processed.
2022-02-14 17:39:19,040 - INFO - Executing main program...
2022-02-14 17:39:19,040 - INFO - Starting fixing text
hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  02cc36dcd3f171e8        1
Sentence that is going to be splitted?  p0s1#1#2        9cc706c938726cf6        1
2022-02-14 17:39:19,049 - INFO - Text fixing finished
2022-02-14 17:39:19,049 - INFO - Finished
2022-02-14 17:39:19,049 - INFO - Input lines: 1 rows
2022-02-14 17:39:19,049 - INFO - Output lines: 2 rows
2022-02-14 17:39:19,049 - INFO - Elapsed time 0.01 s
2022-02-14 17:39:19,049 - INFO - Troughput: 108 rows/s
2022-02-14 17:39:19,049 - INFO - Output file: <stdout>
2022-02-14 17:39:19,049 - INFO - Program finished
cgr71ii commented 2 years ago

No, it is not expected. I just trusted the deferred implementation since the behaviour for the paragraphs identification is very similar. I though that line https://github.com/bitextor/bifixer/blob/276cc2c4850ee7857a63972ddc0e0a1b8fa546db/bifixer/monofixer.py#L207 was due to the deferred reconstructor, but now I'm thinking that the reason for that line might be to avoid the problem that you are saying (i.e. add multiple identifiers). The problem is that, if you execute

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1\t0123456789" | monofixer --scol 1 --sparagraphid 2 --sdeferredcol 3 - - en

you can see that the output will be:

hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  0123456789#1    02cc36dcd3f171e8 1

The output skips the output of the following split lines for the deferred, but I didn't added the mentioned line since I though it was due to the deferred reconstructor as I have said. Maybe @lpla might say if this is intended or this shouldn't happen this way.

cgr71ii commented 2 years ago

The problem now should be fixed, at least for the provided example. But the problem with the deferred still remains. I've checked the code a bit and I don't know if line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201 is intended, since it is a reference, which throughout the code is being modified, so when a line is split and modified due to, at least, deferred or paragraph identification, the next time which should be modified, it will already contain the previous change, what leaded to the initial problem.

cgr71ii commented 2 years ago

I've checked out that the same problem applies to Bifixer as well. In the case of the deferred, only the first line is printed when resplit, and multiple identifiers are added in the case of the paragraphs. Still I really think that line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/bifixer.py#L228 it is intended to work with the deferred reconstructor (I would need feedback from @lpla to be sure). In that case, these bugs might be fixed using copy.copy or copy.deepcopy, depending on the elements, at line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201 but @ZJaume should clarify if this array it is intended to be a reference or doesn't matter if it is a copy (I think that doesn't matter to be a copy if no processing is applied through the resplit segments loop)

ZJaume commented 2 years ago

I think that https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201 wasn't intended to be a reference, but it didn't gave any issue because parts was not being modified. If you really need to query the parts variable as it was prior to modifications, please feel free to do deepcopy.

cgr71ii commented 2 years ago

The commented issues should be solved now with the previous commits: