TALP-UPC / FreeLing

FreeLing project source code
Other
252 stars 96 forks source link

Error in "form" information in conll and json output? #103

Closed anne17 closed 4 years ago

anne17 commented 4 years ago

It is not clear from the documentation what exactly the "form" information in the FreeLing output represents, but I expected it to be the word form as it occurs in the input. This is not always the case though. In the following French example the strings "au" and "des" do not occur anywhere in the "form" information (First column). This poses a problem because the output can no longer be mapped to the input. The analysis was done with FreeLing 4.1.

Call: analyze --output conll -f fr.cfg Input: Elle est liée au développement des systèmes informatiques. Output (with some columns removed for simplicity):


1  Elle          elle          PP3FS00 PP
2  est           être          VSIP3S0 VSI
3  liée          lier          VMP00SF VMP
4  à             à             SP      SP
5  le            le            DA0MS0  DA
6  développement développement NCMS000 NC
7  de            de            SP      SP
8  les           le            DA0CP0  DA
9  systèmes      système       NCMP000 NC
10 informatiques informatique  AQ0CP00 AQ
11 .             .             Fp      Fp
lluisp commented 4 years ago

words "au" and "des" are contractions for "a le" and "de le", which are the forms you get in the analysis. If you want to relate the obtained analysis to the original input , you can use the start/end offsets for each token, and locate them in the original string. Alternatively, you can use the options "--nortkcon --nortk" which will prevent freeling from splitting those contractions.

anne17 commented 4 years ago

Thanks for explaining! So this is not a bug then. But to be honest I still don't really understand what you mean by word form. At my department when we talk about "word form" we usually refer to a string in the input (as opposed to anything that's returned by an analysis). This mostly seems to be the case for FreeLing as well, with a few exceptions like the one mentioned above, or names containing underscores ("Bill Gates" will receive "Bill_Gates" as word form).

Thanks for hinting about the flags, I will definitely try those! Will they work with any language?

lluisp commented 4 years ago

FreeLing is an engineering-oriented tool, and some of the linguistic concepts may be a bit streched. Indeed "word form" should be the input string, but as i said before, FreeLing cares more about producing structured processable output than to provide a perfect mapping between the original string and its analysis. So, in some tokens (e.g. contractions) "word form" is the "string that would have been in the input if it was not contracted" (e.g. "doesn't" is the same than "does not", or "won't" is the same than "will not"). In other tokens such as proper names, dates, numbers, etc, the form is the original string but where whitespaces are replaced with underscores to avoid problems in column-based formats such as CoNLL.

Yes, those flags will work in any language.

anne17 commented 4 years ago

Thanks for explaining! I am closing this now since it's not an issue with FreeLing.