TALP-UPC / FreeLing

FreeLing project source code
Other
252 stars 96 forks source link

Wrong position indices in json output #102

Closed anne17 closed 4 years ago

anne17 commented 4 years ago

The position indices in the json output in the following example seem wrong. This analysis was done using FreeLing 4.1 with the English standard config file. According to the json output the first token starts at position 0 and ends at 2 which corresponds to the string Th, while the second token starts at 3 and ends at 6, giving the string t's. This looks like an error.

Call: analyze --output json -f en.cfg Input: That's okay. (Shortened) output:


    "sentences": [
        {
            "id": "1",
            "tokens": [
                {
                    "id": "t1.1",
                    "begin": "0",
                    "end": "2",
                    "form": "That",
                    "lemma": "that"
                },
                {
                    "id": "t1.2",
                    "begin": "3",
                    "end": "6",
                    "form": "'s",
                    "lemma": "be"
                },
                {
                    "id": "t1.3",
                    "begin": "7",
                    "end": "11",
                    "form": "okay",
                    "lemma": "okay"
                },
                {
                    "id": "t1.4",
                    "begin": "11",
                    "end": "12",
                    "form": ".",
                    "lemma": "."
                }
            ]
        }
    ]
}
lluisp commented 4 years ago

I checked the reason why this happens, and it is a tricky case.

There are many different cases of contractions. Some are pretty obvious such as don't -> do+not that's -> that+is

But some of them are more twisted: ya'll -> you+all won't -> will+not we'd've -> we+would+have or in Spanish: desque -> desde+que Galician: nestoutra en+esta+outra French: auxquelles à+lesquelles

Finding a simple algorithm that would solve correctly any case in any language is not easy, and did not seem worth the effort, since the main goal of FreeLing is converting text to structured data for massive text applications. Keeping track of original positions is a lower priority feature (may be needed for visualization issues, but not much more than that).

So, what is done in these contractions is just distribute evenly the length of the original token among all words resulting from the expansion.

However, it definitely could be better. I added some fixes that will improve the output in most obvious cases (such as "that's"). You'll find it in master branch

thanks!

anne17 commented 4 years ago

Sounds good, thanks! Do you have any estimate on when there will be a new release? :)

lluisp commented 4 years ago

You can get it right away from the master branch. With some luck, there will be a new release this summer, although it is not sure it is available for Windows or Mac yet.

anne17 commented 4 years ago

Thanks! We're looking forward to your next release! I guess this issue should be considered fixed, so let's close it.