Closed cgr71ii closed 2 years ago
Is it expected to add two ids?
echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long. Sentence that is going to be splitted ? \tp0s1" | monofixer --scol 1 --sparagraphid 2 - - en
2022-02-14 17:39:19,039 - INFO - Arguments processed.
2022-02-14 17:39:19,040 - INFO - Executing main program...
2022-02-14 17:39:19,040 - INFO - Starting fixing text
hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long. p0s1#1 02cc36dcd3f171e8 1
Sentence that is going to be splitted? p0s1#1#2 9cc706c938726cf6 1
2022-02-14 17:39:19,049 - INFO - Text fixing finished
2022-02-14 17:39:19,049 - INFO - Finished
2022-02-14 17:39:19,049 - INFO - Input lines: 1 rows
2022-02-14 17:39:19,049 - INFO - Output lines: 2 rows
2022-02-14 17:39:19,049 - INFO - Elapsed time 0.01 s
2022-02-14 17:39:19,049 - INFO - Troughput: 108 rows/s
2022-02-14 17:39:19,049 - INFO - Output file: <stdout>
2022-02-14 17:39:19,049 - INFO - Program finished
No, it is not expected. I just trusted the deferred implementation since the behaviour for the paragraphs identification is very similar. I though that line https://github.com/bitextor/bifixer/blob/276cc2c4850ee7857a63972ddc0e0a1b8fa546db/bifixer/monofixer.py#L207 was due to the deferred reconstructor, but now I'm thinking that the reason for that line might be to avoid the problem that you are saying (i.e. add multiple identifiers). The problem is that, if you execute
echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long. Sentence that is going to be splitted ? \tp0s1\t0123456789" | monofixer --scol 1 --sparagraphid 2 --sdeferredcol 3 - - en
you can see that the output will be:
hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long. p0s1#1 0123456789#1 02cc36dcd3f171e8 1
The output skips the output of the following split lines for the deferred, but I didn't added the mentioned line since I though it was due to the deferred reconstructor as I have said. Maybe @lpla might say if this is intended or this shouldn't happen this way.
The problem now should be fixed, at least for the provided example. But the problem with the deferred still remains. I've checked the code a bit and I don't know if line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201 is intended, since it is a reference, which throughout the code is being modified, so when a line is split and modified due to, at least, deferred or paragraph identification, the next time which should be modified, it will already contain the previous change, what leaded to the initial problem.
I've checked out that the same problem applies to Bifixer as well. In the case of the deferred, only the first line is printed when resplit, and multiple identifiers are added in the case of the paragraphs. Still I really think that line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/bifixer.py#L228 it is intended to work with the deferred reconstructor (I would need feedback from @lpla to be sure). In that case, these bugs might be fixed using copy.copy
or copy.deepcopy
, depending on the elements, at line https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201
but @ZJaume should clarify if this array it is intended to be a reference or doesn't matter if it is a copy (I think that doesn't matter to be a copy if no processing is applied through the resplit segments loop)
I think that
https://github.com/bitextor/bifixer/blob/0c168cca9d9dee26c9ba3a82430d0ec4379dde47/bifixer/monofixer.py#L201
wasn't intended to be a reference, but it didn't gave any issue because parts
was not being modified. If you really need to query the parts
variable as it was prior to modifications, please feel free to do deepcopy
.
The commented issues should be solved now with the previous commits:
If paragraph identification data is provided in the input file and sentences are split, we would like to keep the sentence enumeration in order to reconstruct the split sentences.
The new behaviour is very similar to deferred (flags
--sdeferredcol
and--tdeferredcol
). So, flags--sparagraphid
and--tparagraphid
are introduced in order to specify the columns of the source and target paragraph identification data.If sentences are split, the value
#{no. sentence}
will be added to source and target paragraph identifiers, like it is done with deferred.