Open martinreynaert opened 5 years ago
I was notified FoLiA-stats, as installed on the new server 'violet', should now be able to handle ligatures.
I tested this on 'violet'. Note this was the very first time I ran any FoLiA- or TICCL tool on this new machine.
It seemed very slow.
And it did not work as can be seen from the output file:
reynaert@violet:/reddata$ grep 'F_r_a_n' /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp F_r_a_n_k_r F_r_a_n_s_c_h
The command run was:
reynaert@violet:/reddata$ /exp/sloot/usr/local/bin/FoLiA-stats --max-ngram=3 --separator='_' --collect --tags=div -t max --hemp=/reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW.hemp -e folia.xml$ -o /reddata/PILOTS/LEVITICUS/TESTFRQ/TESTFRQFOLIAtagdivNEW /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/
Ok, closer examining the provided data reveals that the 'ij' ISN'T a ligature but indeed just 2 separate characters. So the patch to handle multi-byte characters didn't work out. I assume the conversion to FoLiA already 'solved' the ligature.
We really need to be more lax hear and accept 2-byte sequences too. This might well turn out to permissive, in which case we could put restrictions, like 'only certain 2-grams' and ' a punctuation, but only on the last position'
Ok, I improved 'hemp' detection. the bi-gram 'ij' is now always accepted, and bi-grams with a trailing punctuation too, but they are assumed to END the 'hemp' @martinreynaert please test this, it is installed on violet.
@martinreynaert I would like to improve, and clarify 'hemp' detection a bit, especially while we are using the same procedure in FoLiA-correct now. I will use some corner-cases to illustrate the difficulties.
Take the following examples:
H E M P
een H E M P dus
een H E M P in een zin
I suppose the hemp to be detected is H_E_M_P
Some cases with a punctuated hemp:
H E M P.
een H E M P. dus
een H E M P. in een zin
een H E. M P. in een zin
1,2 and 3 will give the hemp: H_E_M_P.
4 will give 2 hemps: H_E.
and M_P.
as we consider a punctuated 2-gram as a hemp-stopper.
This may be questionable....
1-digit numbers can also be part of an hemp, like in: 1 2 3
yielding 1_2_3
, but see that
1_2._3
not detects any hemps. But probably 1_2.
is desired, or even 1_2._3
?
NOTE: as an exception the bi-gram 'ij' (and case variants) is also part of a hemp.
To summarize: We need a clear definition of a hemp :)
still waiting for an answer
The hemp parameter in FoLiA-stats collects spaced words. It currently breaks on ligatures (see example). It also fails to collect the last letter if this has a trailing punctuation mark, which happens often.
reynaert@black:/reddata/PILOTS/LEVITICUS$ grep 'F r a n' /reddata/PILOTS/LEVITICUS/FOLIA/NOFOREIGN/levit.03.NoForeigns.folia.xml.txt F r a n s c h zal Z. F r a n k r ij k. uitgeoefend. Z. F r a n k r ij k. F r a n k r ij k. reynaert@black:/reddata/PILOTS/LEVITICUS$ cat TESTFRQ/TESTFRQFOLIAtagdiv.hemp |grep 'F_r_a_n' F_r_a_n_k_r
1/ ligatures should be seen as single characters. 2/ a final character with a trailing punctuation mark should also be collected.
Perhaps both little issues might be solved by allowing for the 'occasional' two character sequence, given repetitions of single characters in historically emphasised text.