TALP-UPC / FreeLing

FreeLing project source code
Other
251 stars 96 forks source link

Problem with proper names when using dashes #36

Closed morethanbooks closed 7 years ago

morethanbooks commented 7 years ago

​ ​Hi, we are using Freeling for annotating Spanish novels and we have found a bug. The POS analyser does analyse correctly a sentence like: "-Estamos desorientados -murmuró el hombre tranquilamente-; nos hemos debido de perder."

In this case FreeLing says that "Estamos" is a verb. But if instead of hyphen you have any kind of dashes, it says that "Estamos" is a proper name (when using the NEC, it says that it is a person):

—Estamos desorientados —murmuró el hombre tranquilamente—; nos hemos debido de perder.

You find both hyphens and dashes at the beginning of direct speech in novels (although actually the dashes are more correct). It would be great if FreeLing could treat the most frequent dashes (– and —) in the same way than hyphens. Is there a file in my installed Freeling version where I can add the dashes as punctuation easily? Thanks!
lluisp commented 7 years ago

That is not a bug, it is intentional.

FreeLing assumes that puntcuations are encoded in ascii characters. For dashes, that is "-", or "--", or even "---".

There are many other Unicode symbols for dashes, quotes, etc (e.g. see http://www.fileformat.info/info/unicode/category/Pd/list.htm) and it would be a nightmare to try to recognize them all.

You can customize your FreeLing installation by adding the required symbols to the punctuation definition file. It is located in data/common/punct.dat (in the source tarball) or in /usr/local/share/freeling/common/punct.dat after installation. See user manual for "punctuation" module to find out more about the format of the file (though it is quite straightforward, and you probably need only to copy the line for the ascii dash and replace with your own

If that does not work, you always can preprocess your texts replacing unicode dashes with ascii dashes.

morethanbooks commented 7 years ago

Hi, perfect, I have edited the file (it was exactly there), I have tested and it works perfect. Many thanks for the answer. It was a great decision to work with FreeLing: great tool. Best regards, José Calvo