computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

TextFragment.getCodedText() problem #321

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Testing my application with JUnit, i have a unitary test for manipulation of 
TextFragment. [The goal is to create a tree from a textFragment, manipulate it, 
and regenerate a new TextFragment from this tree.]

I found a problem creating a tree with a "high depth" of elements:
I want to create a TextFragment corresponding to this code:

   <b>
      <b>
         <b>
            <b>
               <b>
                  <b>
                     <b>
                        <b>
                           <b>
                              <b>
                                 Content
                              </b>
                           </b>
                        </b>
                     </b>
                  </b>
               </b>
            </b>
         </b>
      </b>
   </b>

To create this one, I just create a new TextFragment, appen 10 OPENING code 
"<b>", appen the text "Content", and then appen  10 CLOSING code "</b>".

With the Eclipse debugger, I can check the TextFragment structure, and each 
code is relevant (cf. attachment "sc1.png"): 10 opening codes and 10 closing.

Now, i want to use the string of the CodedText.
I call the method TextFragment.getCodedText(), and observe the String structure.

Assuming that each code is coded into two characters into this String, i have 
to have 20 characters for the 10 opening tags, the text "Content", and 20 
characters for the closing tags. For each pair, the first char is the char type.

With the eclipse debugger, I check the TYPE character of each opening code, 
then the char 0, 2, 4, 6...18. Their values are in the attachment "sc2.png" : 
all are OPENING tag, except the char 10 (<=> the code 5 into the "sc1.png"), 
which is here an ISOLATED tag.

If i refer to 
http://okapi.opentag.com/devguide/gettingstarted.html#readingDocument, it is 
written that if there is an opening tag without the closing one, it is an 
isolated tag. Here, the code 5 has a closing code, but is considered as an 
isolated one.

For the closing tags, the last one is also considered as an isolated one, while 
other are CLOSING tags.

NB: i use the version 0.19 of Okapi Framework.

Original issue reported on code.google.com by aurelien...@gmail.com on 28 Mar 2013 at 12:54

Attachments:

GoogleCodeExporter commented 9 years ago
I tested nesting 11  tags and all seem to work fine:

TextFragment tf = new TextFragment();
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append(TagType.OPENING, "b", "");
tf.append("Content");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
tf.append(TagType.CLOSING, "b", "");
assertEquals("Content", tf.toText());
assertEquals("<1><2><3><4><5><6><7><8><9><10><11>Content</11></10></9></8></7></
6></5></4></3></2></1>", fmt.setContent(tf).toString());

What you se may be the result of how the tags were added: they must have 
matching type ("b" in the example above).
Maybe there is some typo in the test code?

If you don't find what is possibly wrong, please provide the test unit, so we 
can debug it.
Thanks,
-ys

Original comment by yves.sav...@gmail.com on 28 Mar 2013 at 3:41

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
In fact, the problem is not the structure of the TextFragment, but the result 
of the method "getCodedText()".

When calling this method, each code is coded into two characters, the first 
saying the type of the code: "opening code", "closing code" or "isolated code", 
and the second one is an ID-like.

If you display Unicode value of the 5th code into this generated String, it 
says the code is "isloated", while it is not. In a JUnit test, you can just add 
this test:

assertEquals((int)tf.getCodedText().chatAt(0),57601);/*The unicode value of 
open tag*/
assertEquals((int)tf.getCodedText().chatAt(2),57601);
assertEquals((int)tf.getCodedText().chatAt(4),57601);
...
assertEquals((int)tf.getCodedText().chatAt(8),57601);
assertEquals((int)tf.getCodedText().chatAt(10),57601); /*Error there, because 
finds 5763: isolated tag*/
assertEquals((int)tf.getCodedText().chatAt(12),57601);
...

Original comment by aurelien...@gmail.com on 28 Mar 2013 at 4:05

GoogleCodeExporter commented 9 years ago
The toString() call from fmt (a GenericContnet object) in the test uses 
getCodedText(). So if there was a placeholder instead of an opening, we would 
see it.

In any case, if I add to the test and do this:

String ct = tf.getCodedText();
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(0));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(2));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(4));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(6));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(8));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(10));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(12));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(14));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(16));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(18));
assertEquals(TextFragment.MARKER_OPENING, ct.charAt(20));
// Content goes here
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(29));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(31));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(33));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(35));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(37));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(39));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(41));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(43));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(45));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(47));
assertEquals(TextFragment.MARKER_CLOSING, ct.charAt(49));

It passes.
I'm guessing there is something that cause one of the code to be seen as 
placeholder in your code, before you do the asserts.
It could be many things. One would need the full code to see what wrong.

cheers,
-yves

Original comment by yves.sav...@gmail.com on 28 Mar 2013 at 4:57

GoogleCodeExporter commented 9 years ago
Ok, i tried with your example and it works correctly.
But, if i construct the textFragment as my tree algorithm does, the 
construction of the TextFragment can be coded like this:

        TextFragment tf = new TextFragment();
        tf.append("Content"); // the deepest child
        for (int i=0; i<10; i++){
            TextFragment tf2 = new TextFragment();
            tf2.append(TagType.OPENING, "b", "");
            tf2.append(tf);
            tf2.append(TagType.CLOSING, "b", "");
            tf = tf2;
        }

There, JUnit displays:
Failed tests:   okapidDegub: expected:<57601> but was:<57603>

Original comment by aurelien...@gmail.com on 29 Mar 2013 at 8:32

GoogleCodeExporter commented 9 years ago
Thanks for the code. That is explaining a lot.
the tf2.append(tf) triggers an insertion of a TF into another one. That 
operation may cause the re-balancing of the codes: that is when the IDs for the 
closing codes are matched with their opening counterparts. And, as you can see 
that happens when there is one more opening than closing.
There is probably some side effect that occurs then that prevents the proper 
matching. The documentation for TF.append(Code, String, String) should probably 
mention that the auto-pairing of closing/opening code needs to be done before 
any re-balancing is done.
Maybe there are ways to fix this. I'll try to look at it in the coming days.
-ys

Original comment by yves.sav...@gmail.com on 29 Mar 2013 at 12:15

GoogleCodeExporter commented 9 years ago
This is a tricky situation. We could change the TF.insert() code so the closing 
markers would be set to -1 and rebalanced when doing the insert. But that would 
break the code in other places.
This issue here is that each TF has its own set of IDs so when we append or 
insert two TFs with codes we have to somehow find a way to adjust the IDs if 
they overlap.
In your test code that happens when the inserted code equals the number of 
pairs divided by 2 plus 1.
There are several ways to work around the problem.

One is to force the IDs:

TextFragment tf = new TextFragment();
tf.append("Content"); // the deepest child
for ( int i=0; i<10; i++ ) {
    TextFragment tf2 = new TextFragment();
    tf2.append(TagType.OPENING, "b", "", 10-i);
    tf.insert(0, tf2);
    tf2 = new TextFragment();
    tf2.append(TagType.CLOSING, "b", "", 10-i);
    tf.insert(-1, tf2);
}
assertEquals("<1><2><3><4><5><6><7><8><9><10>Content</10></9></8></7></6></5></4
></3></2></1>", fmt.setContent(tf).toString());

The other one is to assign a unique 'type' to each paired codes:

TextFragment tf = new TextFragment();
tf.append("Content"); // the deepest child
for ( int i=0; i<10; i++ ) {
    TextFragment tf2 = new TextFragment();
    tf2.append(TagType.OPENING, "b"+i, "");
    tf2.append(tf);
    tf2.append(TagType.CLOSING, "b"+i, "");
    tf = tf2;
}
assertEquals("<1><2><3><4><5><6><7><8><9><10>Content</10></9></8></7></6></5></4
></3></2></1>", fmt.setContent(tf).toString());

for now I don't think we can change the TF.insert() to allow your code to work 
because it would break several filters. but We'll try to see if we can improve 
this.

-ys

Original comment by yves.sav...@gmail.com on 29 Mar 2013 at 2:35

GoogleCodeExporter commented 9 years ago
One more thing: there was a bug also (the balancing was incorrectly reset when 
is should not).
I don't think it affected your example. but its was nice to catch.
-ys

Original comment by yves.sav...@gmail.com on 29 Mar 2013 at 2:36

GoogleCodeExporter commented 9 years ago
Thanks a lot for the investigation!
In my case i think it will be easier to manipulate unique 'type', thanks for 
the tips and for your work,

cheers,
Aurelien

Original comment by aurelien...@gmail.com on 29 Mar 2013 at 2:52

GoogleCodeExporter commented 9 years ago
I'm keeping this issue open (with a lower priority)
As it would be nice to allow the following to work.

TextFragment tf = new TextFragment();
tf.append("Content"); // the deepest child
for ( int i=0; i<10; i++ ) {
    TextFragment tf2 = new TextFragment();
    tf2.append(TagType.OPENING, "b", "");
    tf2.append(tf);
    tf2.append(TagType.CLOSING, "b", "");
    tf = tf2;
}

Original comment by yves.sav...@gmail.com on 15 Apr 2013 at 12:22