Closed anne17 closed 4 years ago
I checked the reason why this happens, and it is a tricky case.
There are many different cases of contractions. Some are pretty obvious such as don't -> do+not that's -> that+is
But some of them are more twisted: ya'll -> you+all won't -> will+not we'd've -> we+would+have or in Spanish: desque -> desde+que Galician: nestoutra en+esta+outra French: auxquelles à+lesquelles
Finding a simple algorithm that would solve correctly any case in any language is not easy, and did not seem worth the effort, since the main goal of FreeLing is converting text to structured data for massive text applications. Keeping track of original positions is a lower priority feature (may be needed for visualization issues, but not much more than that).
So, what is done in these contractions is just distribute evenly the length of the original token among all words resulting from the expansion.
However, it definitely could be better. I added some fixes that will improve the output in most obvious cases (such as "that's"). You'll find it in master branch
thanks!
Sounds good, thanks! Do you have any estimate on when there will be a new release? :)
You can get it right away from the master branch. With some luck, there will be a new release this summer, although it is not sure it is available for Windows or Mac yet.
Thanks! We're looking forward to your next release! I guess this issue should be considered fixed, so let's close it.
The position indices in the json output in the following example seem wrong. This analysis was done using FreeLing 4.1 with the English standard config file. According to the json output the first token starts at position 0 and ends at 2 which corresponds to the string
Th
, while the second token starts at 3 and ends at 6, giving the stringt's
. This looks like an error.Call:
analyze --output json -f en.cfg
Input:That's okay.
(Shortened) output: