Closed anne17 closed 4 years ago
words "au" and "des" are contractions for "a le" and "de le", which are the forms you get in the analysis. If you want to relate the obtained analysis to the original input , you can use the start/end offsets for each token, and locate them in the original string. Alternatively, you can use the options "--nortkcon --nortk" which will prevent freeling from splitting those contractions.
Thanks for explaining! So this is not a bug then. But to be honest I still don't really understand what you mean by word form. At my department when we talk about "word form" we usually refer to a string in the input (as opposed to anything that's returned by an analysis). This mostly seems to be the case for FreeLing as well, with a few exceptions like the one mentioned above, or names containing underscores ("Bill Gates" will receive "Bill_Gates" as word form).
Thanks for hinting about the flags, I will definitely try those! Will they work with any language?
FreeLing is an engineering-oriented tool, and some of the linguistic concepts may be a bit streched. Indeed "word form" should be the input string, but as i said before, FreeLing cares more about producing structured processable output than to provide a perfect mapping between the original string and its analysis. So, in some tokens (e.g. contractions) "word form" is the "string that would have been in the input if it was not contracted" (e.g. "doesn't" is the same than "does not", or "won't" is the same than "will not"). In other tokens such as proper names, dates, numbers, etc, the form is the original string but where whitespaces are replaced with underscores to avoid problems in column-based formats such as CoNLL.
Yes, those flags will work in any language.
Thanks for explaining! I am closing this now since it's not an issue with FreeLing.
It is not clear from the documentation what exactly the "form" information in the FreeLing output represents, but I expected it to be the word form as it occurs in the input. This is not always the case though. In the following French example the strings "au" and "des" do not occur anywhere in the "form" information (First column). This poses a problem because the output can no longer be mapped to the input. The analysis was done with FreeLing 4.1.
Call:
analyze --output conll -f fr.cfg
Input:Elle est liée au développement des systèmes informatiques.
Output (with some columns removed for simplicity):