Expose low level line breaking API

filodej commented 10 years ago

Hi, first of all I'd like to thank you for a really helpful piece of software.

To give you a context - I am implementing a text shaping engine in c++ and I was trying to implement a line breaking iterator (conceptually similar to one implemented in ICU).

When I was trying to utilize your library I faced a problem that it performs the line breaking algorithm all at once - in a single call while my iterator needs to process the input in small chunks (by individual code points).

So this pull request contains my approach expose the low level API (three character level functions) to client:

- lb_init_break_context() 
- lb_handle_first_char() 
- lb_handle_next_char()

... and re-implement the set_linebreaks() function on top of that.

Please, feel free to reject this request if my approach is too intrusive or simply does not fit to your taste. (I'd be by no means offended and would be quite happy to stay on my fork if necessary). Also I am a C++ programmer so my C skills may not be very good, but I tried hard to stick with your coding style.

Kind Regards Filodej

adah1972 commented 10 years ago

Hi Filodej,

I actually like this idea, but your code seems to have some flaws, as it fails the showbreak test: http://vimgadgets.cvs.sourceforge.net/viewvc/vimgadgets/common/tools/showbreak/

Pay attention to the (wrong) output that begins with "(".

Debugging others' code seems to take me more efforts than refactoring from my original code, so I'll leave the fix to you. I suggest you first refactor with Extract Method Object (taking LineBreakContext as an object). You also need to consider how the final functional calls should look like and refactor towards that AFTER the first step of refactoring. I think this is what set_linebreaks should look like finally, more or less:

void set_linebreaks( const void s, size_t len, const char lang, char *brks, get_next_char_t get_next_char) { utf32_t ch; struct LineBreakContext lbc; size_t posCur = 0;

lb_init_context(&lbc, lang, brks);

do
{
    ch = get_next_char(s, len, &posCur);
    if (ch == EOS)
       break;
    lb_process_next_char(&lbc, ch, posCur);
}

lb_finalize(&lbc, len, posCur);

}

(The logic of "first char" seems wrong to me, as each character after the new line should be treated as the first character. I think you need to introduce a boolean variable, if you want to eliminate my goto, as is necessary for the current purpose.)

BTW, pay attention to the coding style too. The current code use spaces and parentheses a little differently from yours. I use "switch (…)", "for (…)", and "func(…)", and never put spaces inside parentheses. There are also a few trailing spaces, but those are pretty easy to fix.

Happy hacking!

Yongwei

On 5 November 2013 22:19, filodej notifications@github.com wrote:

Hi, first of all I'd like to thank you for a really helpful piece of software.

To give you a context - I am implementing a text shaping engine in c++ and I was trying to implement a line breaking iterator (conceptually similar to one implemented in ICU).

When I was trying to utilize your library I faced a problem that it performs the line breaking algorithm all at once - in a single call while my iterator needs to process the input in small chunks (by individual code points).

So this pull request contains my approach expose the low level API (three character level functions) to client:

lb_init_break_context()

lb_handle_first_char()

lb_handle_next_char()

... and re-implement the set_linebreaks() function on top of that.

Please, feel free to reject this request if my approach is too intrusive or simply does not fit to your taste. (I'd be by no means offended and would be quite happy to stay on my fork if necessary). Also I am a C++ programmer so my C skills may not be very good, but I tried hard to stick with your coding style.

Kind Regards Filodej

You can merge this Pull Request by running

git pull https://github.com/filodej/libunibreak filodej/low-level-api

Or view, comment on, or merge it at:

https://github.com/adah1972/libunibreak/pull/2

Commit Summary

Refactor set_linebreaks() function Introduce LineBreakContext structure and further refactor set_linebreaks() function Publish low level linebreaking API usable for incremental linebreaking

File Changes

M src/linebreak.c (245) M src/linebreakdef.h (25)

Patch Links:

https://github.com/adah1972/libunibreak/pull/2.patch https://github.com/adah1972/libunibreak/pull/2.diff

Wu Yongwei URL: http://wyw.dcweb.cn/

filodej commented 10 years ago

Hi Yongwei, Thank you for valuable feedback!

I am really sorry for sending you bogus implementation as was the case last time (I did not know about the test you mentioned and my - higher level - formatter tests seemed to work properly, so I missed it).

Now I have rewritten the code following your advice and run the test successfully. Also I was making smaller commits (and compiled and run the test after each commit) in hope that it will make the code review easier for you.

I have few remarks regarding the current version:

1)

you wrote: lb_init_context(&lbc, lang, brks);

For now I did not include the brks variable to the LineBreakingContext as I feel that it actually does not represent linebreaker's minimal state. In my opinion it rather represents the inserter concept (along with posCur and posLast variables). For example in my usecase, there is no brks array at all. I am just implementing an interface traversing a string character-by-character and returning individual break opportunities to a client. The interface looks similar to this:

    class break_iterator : public ref_counted
    {
    public:
        ...
        /// returns UCS-4 decoded Unicode character (code point)
        virtual utf32_t utf32_char() const = 0;
        /// specification of a line break opportunity between current two Unicode codepoints
        virtual linebreak_t linebreak() const = 0;
        /**
         * @brief moves the iterator to the next unicode character (code point)
         *
         * @return the actual flow offset increment
         */
        virtual flow_offset_t next_char() = 0;
        ...
    };

So my current version of the set_linebreaks() function does not look quite like the one you proposed, but the inserter concept could be definitely factored out separately (As an LineBreakInserter struct containing brks, posCur and posLast would be introduced).

2)

You wrote: The logic of "first char" seems wrong to me, as each character after the new line should be treated as the first character.

The lb_handle_first_char()/lb_handle_next_char() function distinction was in the previous version due to the fact that it is not possible to return a break result for very first character (on the other hand for a character after a new line it is still possible to return LINEBREAK_MUSTBREAK). It would be probably possible to test for this special case in the lb_process_next_char() and return i.e. LINEBREAK_UNDEFINED but I decided to pass the very first character to the lb_init_break_context(). This allows me to initialize all context members to meaningful values.

I understand that the approach you proposed (including brks to the context) has no problem with this (it just inserts a break when it is known), but in my opinion it has drawbacks described above.

I hope you'll like the new version better than the previous one.

Kind regards, Filodej

adah1972 commented 10 years ago

Very nice approach. Thanks!

adah1972 / libunibreak

Expose low level line breaking API #2