adah1972 / libunibreak

The libunibreak library
zlib License
173 stars 38 forks source link

Update WordBreak algorithm to Unicode 7 #9

Closed tasn closed 9 years ago

tasn commented 9 years ago

There are a few additions that need to be taken care of. Wordbreak code doesn't work with Unicode 7 at the moment.

adah1972 commented 9 years ago

Hi Tom, you defined WBP_Hebrew in wordbreakdef.h. Should WBP_Hebrew_Letter replace that enumerator? I think it might also be a good idea to put WBP_Any always at the end.

I updated ChangeLog, but I did not include your changes to wordbreak.h. I will have a separate entry to update all the file headers.

tasn commented 9 years ago

Thanks for the suggestion. Yes, it's the same. You are right. I had the same mistake with single quote and double quote. Also fixed.

At first I didn't want to reorder because the header was shipped, but now that I don't ship the header anymore, it's not part of our API (never intended to be), and I can just change things.

Btw, I don't remember if I've made this comment already, but linebreakdef.h also leaks a lot of things that should be internal and we should probably split that one to a private one (linebreakdef.h) and a public one (unibreakdef.h).

adah1972 commented 9 years ago

Good idea. Which things you think should be moved to unibreakdef.h?

I think maybe the following:

EOS get_next_char_t lb_get_next_char_utf8 lb_get_next_char_utf16 lb_get_next_char_utf32

I also think we should replace LINEBREAK_VERSION/linebreak_version with UNI…/uni…, and move those stuff to unibreakdef.h too. And also utf…_t.

What do you think? I can make the changes once we agree on the items to move.

(I have made some other changes to the repository. Please have a look to see whether there are issues.)

tasn commented 9 years ago

Well, it's actually more complicated than that. I was talking about private (not installed) and public (installed), though there's another facet, which is shared within linebreak and wordbreak vs not.

As for public facing stuff: linebreakdef.h should not be installed. I just looked at it, there's nothing public about it. We need to remove it from the installed headers in Makefile.am and add it to EXTRA_DIST. I hope I didn't forget to do it for wordbreakdef.h too, can't check atm.

After that, we need unibreakdef.h, which is also not installed that will include those you mentioned.

That should be it for now.

adah1972 commented 9 years ago

Hi Tom, I know what you meant. Maybe UNIBREAK_VERSION is not a good example, as it should be public. Now I think it should not be in unibreakdef.h.

Even though normal users do not use *def.h, power users still may. Publishing them does no harm, but hiding them would at least make the creator of lb_init_break_context and lb_process_next_char unhappy.

tasn commented 9 years ago

Maybe lb_process_next_char() should be public. I dunno.

All I know is that the things in wordbreakdef.h are of no use even to power users, as the info there means nothing and doesn't provide any extra internal access. The whole thing it does is that it forces us to maintain that as API (if it's public...) which I'd like not to need to care about.

Luckily, because those are of no consequences, I could reorder and modify the enum as I pleased, but this level of freedom is not possible with public API.

adah1972 commented 9 years ago

If you do not want to make wordbreakdef.h public, I think that is fine. I do not think I want to put EOS and LineBreakContext in the linebreak.h, due to the name pollution possibility. I think it is OK for people to manually include that file and use it.

Even in the general case, we do not need to keep enumerators the same. When they change, we only need to change -version-info to ensure people cannot use the new dynamic libraries without a rebuild.

I will make some changes later, probably in the weekend.

tasn commented 9 years ago

Yes, we could change -version-info, but that sucks. Why should we force our users to rebuild everything if we don't have to? Furthermore, because it's unused, it won't really affect anyone (it's actually not a problem when it's not used to communicate with code inside the lib).

It all comes down to this: it's not useful for anyone, and it's possibly harmful.

adah1972 commented 9 years ago

If any function exposed (whether we have the header file or not) has any ABI change, we need to change version-info. On the contrary, if no one really uses your header file or using the header file does not affect the ABI, we do not need to change version-info. It is kind of independent of whether we distribute the header file or not. :-)

tasn commented 9 years ago

As I said in my last comment, it's not actually an ABI issue because it's not useful, so people can't use it in any ABI meaningful way. It's still an API promise even if we force people to recompile, as people may depend on the order. We are putting limitations on ourselves that we really shouldn't be putting. That's the issue of distributing internal implementation details.

As for EOS: it's not namespaced with our lib name so it can clash with other similar definitions in people's projects.

adah1972 commented 9 years ago

People should not depend on the enumerator order, except maybe the first and the last. That is why I think WBP_ANY should be the last. If the last enumerator looks special (like "ANY", "LAST", etc.), people may expect it to be always the last.

I am fully aware that EOS is not namespaced, and that is why it is not defined in linebreak.h. Prefixing is ugly. Even if I had made it LB_EOS, I would have had to rename it to UB_EOS today. And who knows whether somebody may happen to think UB is a good prefix for his project!

tasn commented 9 years ago

"Except maybe" is my point exactly. It's a source for errors, as all "maybe"s and "except"s are. When it comes to API/ABI you should always assume the worse, always assume there's a user that will use it in ways you don't expect/intend. Surely, sometimes you just have to say "well, you shouldn't have done it" and let the user bear the pain. I'm trying to limit the scope of that. As said before, the thing that stands out here the most, is the uselessness of the APIs in question. So I don't see why go through such a risk when there's no benefit.

ub/lb are bad namespaces (because they are short), though are still better than none. The better prefix would have been unibreak, or at least unib. Anyhow, even if someone ever namespaces with ub, the chance of a clash is miniscule because of the naming of the symbols themselves. As for the renaming comment, that's nonsense... Libraries almost never get renamed. Also, to be honest, we should have handled the rename a bit better. We should have deprecated all of the old APIs, replace them by macro wrappers around the new naming (so we don't expose any of the old symbols) and have a more consistent namespace all across the lib.

adah1972 commented 9 years ago

Names are for humans, not for machines. I want the name to be easy to use, instead of being confusing. If LB no longer makes sense, I will change it. Anyway, let me have a try first in the weekend.

adah1972 commented 9 years ago

BTW, do you mind me making wordbreakdata.c not be included? I think it is a little confusing, and I actually wrongly moved wordbreakdata.c from EXTRA_DIST to libunibreak_la_SOURCES today…. I think it would make better code separation, and also would accelerate compilation a little, though the latter is trivial on today's hardware.

adah1972 commented 9 years ago

I have made many clean-ups. Maybe still not satisfactory to you, but I think I have made things better at least.

tasn commented 9 years ago

While names are also for humans, they are also for machines. Names need to be unique for machines, so it's a balance. While it needs to be easy to use, it also has to not clash.

wordbreakdata.c: the reason why we ship that (I think we do the same with lineberakdata.c), is that the package can be compiled without an internet connection, and even if the latest wordbreak definitions are not yet supported by us. We can fix the latter by changing the link to a specific UAX data and not the latest like we do at the moment, but not the former.

adah1972 commented 9 years ago

Yes, it is a balance. It also depends on the likeliness of the clash :-). I won't hesitate to uglify the names, if there is some real/upcoming need.

Sorry if I was unclear about wordbreakdata.c. My use of "include" meant "#include" inside wordbreak.c. It is definitely not about leaving the file out.

tasn commented 9 years ago

Ah, I like it this way, there's no need to compile it into a separate object, however I don't care, do what you wish.

adah1972 commented 9 years ago

Actually, your reason should be that you used the length of wb_prop_default in wordbreak.c. We used different search algorithms in wordbreak.c and linebreak.c. That is not worth changing, I suppose.

tasn commented 9 years ago

OK, that's another reason then. I didn't even remember that one. :)