LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

Operation 'hemp' parameter in FoLiA-stats #29

Open martinreynaert opened 5 years ago

martinreynaert commented 5 years ago

The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.

reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt F r a n s c h zal Z. F r a n k r ij k. uitgeoefend. Z. F r a n k r ij k. F r a n k r ij k. reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n' F_r_a_n_k_r

1/ ligatures should be seen as single characters. 2/ a final character with a trailing punctuation mark should also be collected.

Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.

martinreynaert commented 5 years ago

I was notified FoLiA-stats, as installed on the new server 'violet', should now be able to handle ligatures.

I tested this on 'violet'. Note this was the very first time I ran any FoLiA- or TICCL tool on this new machine.

It seemed very slow.

And it did not work as can be seen from the output file:

reynaert@violet:/reddata$ grep 'F_r_a_n' /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp F_r_a_n_k_r F_r_a_n_s_c_h

The command run was:

reynaert@violet:/reddata$ /exp/sloot/usr/local/bin/FoLiA-stats --max-ngram=3 --separator='_' --collect --tags=div -t max --hemp=/reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp -e folia.xml$ -o /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/

kosloot commented 5 years ago

Ok, closer examining the provided data reveals that the 'ij' ISN'T a ligature but indeed just 2 separate characters. So the patch to handle multi-byte characters didn't work out. I assume the conversion to FoLiA already 'solved' the ligature.

We really need to be more lax hear and accept 2-byte sequences too. This might well turn out to permissive, in which case we could put restrictions, like 'only certain 2-grams' and ' a punctuation, but only on the last position'

kosloot commented 5 years ago

Ok, I improved 'hemp' detection. the bi-gram 'ij' is now always accepted, and bi-grams with a trailing punctuation too, but they are assumed to END the 'hemp' @martinreynaert please test this, it is installed on violet.

kosloot commented 5 years ago

@martinreynaert I would like to improve, and clarify 'hemp' detection a bit, especially while we are using the same procedure in FoLiA-correct now. I will use some corner-cases to illustrate the difficulties.

Take the following examples:

  1. H E M P
  2. een H E M P dus
  3. een H E M P in een zin

I suppose the hemp to be detected is H_E_M_P

Some cases with a punctuated hemp:

  1. H E M P.
  2. een H E M P. dus
  3. een H E M P. in een zin
  4. een H E. M P. in een zin

1,2 and 3 will give the hemp: H_E_M_P. 4 will give 2 hemps: H_E. and M_P. as we consider a punctuated 2-gram as a hemp-stopper. This may be questionable....

1-digit numbers can also be part of an hemp, like in: 1 2 3 yielding 1_2_3, but see that 1_2._3 not detects any hemps. But probably 1_2. is desired, or even 1_2._3?

NOTE: as an exception the bi-gram 'ij' (and case variants) is also part of a hemp.

To summarize: We need a clear definition of a hemp :)

kosloot commented 4 years ago

still waiting for an answer