annotation / text-fabric

File format, model, API, and apps for manipulating text and its annotated features
MIT License
67 stars 22 forks source link

bug in defining relations between elements #61

Closed oliverglanz closed 2 years ago

oliverglanz commented 4 years ago

Problem There is a bug in TF when it comes ot the definition of relations between elements. The results for the following query shows Gen 20:2 as a result because it takes the contents of c3 as being identical to the contents of c1. This is, however, illogical, since the relation between c3 and c2 is defined as c3 following c2 within 0-50 words. Thus, if c3 is to follow c2 and c2 is to follow c1 it cannot be that c3 is positioned in the clause sequence like c1:

Annotation 2020-09-26 125125

In order to clarify the relation between c1 and c3 and overcome the identification of both clauses (although mistakenly) one could simply alter the query by explicitly defining the relation between c1 and c3 more precisely (c1 should be followed by c2, and c2 should be followed by c3, and c1 should be followed by c3):

Annotation 2020-09-26 125254

This, however, yields no results.

Obviously, the search-engine is confused about the relations between the clauses. This confusion is not caused by the lines

speakerA .lex=lex. speakerB
addresseeA .lex=lex. addresseeB

Even without these lines the confusion remains.

Assumption I assume that the search-engines has somewhere a bug that does not allow the correct processing of complex substructures (c1 and c3 have elements with in them) with the explicit relation operators. Once the complex substructure is taken out, the clause relations are recognized correctly:

Annotation 2020-09-26 125930

With MQL these complex relations can be queried without problem: https://shebanq.ancient-data.org/hebrew/query?version=2017&id=491

dirkroorda commented 4 years ago

c1 <10: c2 means that c1 is immediately before c2 with a leeway of 10 in both directions.

So if c2 has slot number 100, c1 could have 99 + or - 10, so anything between 89 and 109, including 100, which is c1.

So this is intentional. I remember earlier discussions about this point, I think with Cody, and yes, we could have defined it in another way, but that would cause other inconveniences.

dirkroorda commented 4 years ago

Now the rest of your remarks:

First I run

verse book=Genesis chapter=20
  c1:clause domain=N
  <3: clause domain=Q
  <50: c2:clause domain=N

c1 < c2

(a shorter version of your simplified query) and it gives me also 15 results (working on BHSA version c)

dirkroorda commented 4 years ago

Now let's see what happens if I run the full query against version c:

verse book=Genesis chapter=20
  c1:clause domain=N
    phrase function=Pred
      word lex=DBR[|QR>[|>MR[
    phrase function=Subj
      speakerA:word sp=subs|nmpr
    phrase function=Cmpl
      addresseeA:word sp=subs|nmpr
  <3: c2:clause domain=Q
  <50: c3:clause domain=N
    phrase function=Pred
      word lex=DBR[|QR>[|>MR[
    phrase function=Subj
      speakerB:word
    phrase function=Cmpl
      addresseeB:word

c1 < c3
speakerA .lex. speakerB
addresseeA .lex. addresseeB

I also get no results. It took me some while to understand the query and now I understand why there are no results:

The query states that clauses c1, c2, c3 are all in the same verse! But clearly, when you allow c3 to be 50 words further, you do not expect it still to be in the same verse!

dirkroorda commented 4 years ago

If you postulate only c1 and c2 to be in the same verse, you have to write it like this

verse book=Genesis chapter=20
  c1:clause domain=N
    phrase function=Pred
      word lex=DBR[|QR>[|>MR[
    phrase function=Subj
      speakerA:word sp=subs|nmpr
    phrase function=Cmpl
      addresseeA:word sp=subs|nmpr
  <3: c2:clause domain=Q

c3:clause domain=N
  phrase function=Pred
    word lex=DBR[|QR>[|>MR[
  phrase function=Subj
    speakerB:word
  phrase function=Cmpl
    addresseeB:word

c2 <50: c3
c1 < c3
speakerA .lex. speakerB
addresseeA .lex. addresseeB

And that query gives me 1 result:

image image
dirkroorda commented 4 years ago

So, Oliver, I think the things you spotted are not bugs in TF after all.

But they are excellent examples of how writing queries requires quite a bit of teaching in order to avoid these pitfalls.

oliverglanz commented 4 years ago

Dirk, thats what it was! A too narrow top-container (verse). My bad! Sorry to have spoiled your time on this one.

But to clarify the matter more:

  1. If c1 <10: c2 could man that c2 stands 10 monads before c1 (c2 could precede c1) then I always HAVE to ADD c1 < c2 if I only want the option to have c2 FOLLOW c1 within a range of 10 monads. Right?
  2. Does TF allow for defining distances between elements in the form of repetitions? For example in MQL I can define the distance between c2: clause domain=Q and c3:clause domain=N by expressing:
    [clause domain="Q"]*{1-5}
    [clause domain="N"]

    This finds all cases in which the first clause (domain="Q") is repeated up to 5 times before the second clause (domain="N") appears. In TF it seems that this option is not available. Relations between elements can only be defined by a range of monads. Is that correct?

dirkroorda commented 4 years ago

Yes, TF has not the Kleen star operation and its friends.

Yes, you are right, you have to add c1 < c2 to c1 <10: c2 if you want to make sure that c2 comes after c1.

It is tempting for me to change the definition into the meaning that the leeway always counts in the direction of the <, but it has disadvantages:

1) what should I do with the operators :k= and =k: ? Probably there the leeway should count in both directions. 2) what if a user wants the leeway in the other direction? I need a new operator for that, or something with a minus: c1 <-10: 2. 3) what if a user wants the leeway in both directions? I need something like c1 <-10,10 c2

With hindsight, these might have been better options. I could try to implement them, but it should be done in a backward compatible way.

Like

c1 <k: c2 means leeway of k in both directions (as before)

And the new ones:

c1 <+k: c2 means leeway of k in forward direction

c1 <-k: c2 means leeway of k in backward direction

c1 :+k> c2 means leeway of k in backward direction (because <: works in the other direction)

c1 :-k> c2 means leeway of k in forward direction (because <: works in the other direction)

c1 <-k+m: c2 means leeway of k in backward direction and leeway of m in forward direction

Likewise for

c1 :-k+m= c2

c1 =-k+m: c2

Can he get it from the unidirectional leeway? No, because there is no OR between relational conditions:

dirkroorda commented 4 years ago

It will be no rocket science to implement this, but I have to be very careful. It affects parsing and semantics of queries. When I find the time, I'll definitely do this, if you think it is useful in this form.

dirkroorda commented 4 years ago

Just FYI: it involves modifying a bunch of functions like this: https://github.com/annotation/text-fabric/blob/47a9e4bcb9ab307d975d52d6e7955f26231f0605/tf/search/relations.py#L545, where the k is the leeway. So instead of passing it a k, it gets a k and a h, one k for forward leeway and h for backward leeway.

-h+k => function(h, k) +k => function(0, k) -h => function(h, 0) k => function(k, k) {this is the old behaviour}

dirkroorda commented 4 years ago

You see, I'm already anticipating coding it.

oliverglanz commented 4 years ago

Dirk, you keep fascinating us with your listening to the community, seeking to understand their operations, and trying to respond to their needs. For my own processes, I am fine with adding further relational definition (c1 <20: c2 AND c1 < c2) to get what I want. Rather than refining the coding of relational definitions, I would love to have Kleen Star & Friends implemented. But, like you said elsewhere, the researcher might have to learn some hand-coding instead of demanding too much from TF's search function.

dirkroorda commented 2 years ago

To be honest, I have not come round to implement this. It seems that TF has reached some optimum here between expressive power and coding effort. I'd rather leave it as it is for now.