composewell / unicode-data

Access unicode character database
Apache License 2.0
18 stars 6 forks source link

Release a new version of `unicode-data` #123

Open adithyaov opened 4 months ago

adithyaov commented 4 months ago

unicode-data-0.4.0.1's test cases seem to break with the newer GHCs. (newer base versions) See: https://github.com/composewell/unicode-data/issues/118 I can confirm that this is the case for 9.10 and 9.8. But the CIs for the latest master are passing so the problem seems to have been fixed.

The hackage has version-bounds for base that are incorrect. With base-4.20 above mentioned test fails

Can release a newer version of unicode-data with the fix included? We can then, update the dependent packages accordingly. Should we re-revise the version bounds on hackage?

wismill commented 4 months ago

I am working on further improvements, but if you are in hurry you can release a minor version.

adithyaov commented 4 months ago

I'll make a minor release for the time being. What do you suggest we do about the incorrect version bounds on hackage for v0.4.0.1? Should we re-revise the version bounds or deprecate the version?

wismill commented 4 months ago

About the tests: they probably fail because you are comparing to base which has a different Unicode version. I fixed these tests to make them pass when characters are unassigned or changed General Category. They will display a warning for such cases.

If you re-generate using ucd2haskell and bumping Unicode to 15.1 the latest release, tests should pass with base-4.20. So the release is not broken per se, only the test suite.

I am improving the lib before bumping to Unicode 15.1. Notably, I would like to reduce the Addr# blobs and to check the inlining pragmas.

adithyaov commented 4 months ago

If you re-generate using ucd2haskell and bumping Unicode to 15.1 the latest release, tests should pass with base-4.20. So the release is not broken per se, only the test suite.

Gotcha, I'll make a minor release then. Should I deprecate unicode-data-0.4.0.1? The version bounds are too lax and might result in undefined behaviour if anyone uses unicode primitives from both base and unicode-data simultaniously.

wismill commented 4 months ago

So it makes sense to completely keep unicode-data in sync with base. We can possibly make the version bounds for the base dependency restrictive.

I am leaning towards this too, because this may trigger much trickier bugs in workflows. I added tracking of Unicode version in the README, because comments in the code are not very discoverable.

The thing is, text uses case mappings from Unicode 14.0, independently of the version of base. So there is precedent, although this is not a good situation.

Well, the solution would be for everyone to use unicode-data, obviously 😅. Part of unicode-data has been merged into base (now in ghc-internal). Now I am thinking we could move this out from ghc-internal to create unicode-data-core as a new boot/core GHC library. But we should make base depend on it, so that what decides the Unicode version is not directly base anymore, but only unicode-data-core. Thus every package using base and unicode-data would share the same Unicode version. If we include complex case mappings, then make text depends on unicode-data-core as well.

That’s a huge change though, and this will have to go through CLC. But since there are already bits of unicode-data in ghc-internal and that text is desync for case mappings, I guess there will be no strong issue.

We already planned to change the versioning scheme to follow closely the one of Unicode. So I can see the following happening:

base, on the contrary, should have lax bounds on unicode-data-core. I do not expect the core API to change anytime soon, so something like unicode-data-core >= 15.0.0 may be enough.

Finally, if we go that road, that means unicode-data-core cannot depends on base anymore.


Will probably have to open a dedicated issue for this, sorry for the wall of text 😅

wismill commented 4 months ago

Should I deprecate unicode-data-0.4.0.1? The version bounds are too lax and might result in undefined behaviour if anyone uses unicode primitives from both base and unicode-data simultaniously.

I would just fix the version bounds for base. I am just restarting to develop this lib after a long pause, so I am not sure it is in state for a release. I mean if you must, do it, but I am not satisfied with some changes I have done a year ago.

adithyaov commented 4 months ago

@wismill Looks like I somehow managed to delete a comment I made.

Re-writing the essence of comment for context:

Unicode version of base and unicode-data should be in sync as using both unicode-data and base at once might have unexpected behaviour. The end user does not care about the unicode version and would use primitives from both unicode-data and base.

Looks like there is already a lot of thought put into keeping packages in sync. Once we decide on how we want to do this, you can possibly offload some tasks to me.

I would just fix the version bounds for base. I am just restarting to develop this lib after a long pause, so I am not sure it is in state for a release. I mean if you must, do it, but I am not satisfied with some changes I have done a year ago.

I will fix the version bounds for base in 0.4.0.1 and make a minor release 0.4.0.2 branching off 0.4.0.1 and updating the unicode version. The minor release is required for the time being as we need to get streamly working with ghc > 9.4.

Again, thank you for the amazing work!

wismill commented 4 months ago

updating the unicode version

@adithyaov this is a breaking change. You should bump to 0.5 then.

Bodigrim commented 4 months ago

To unblock downstream developments I made a revision: https://hackage.haskell.org/package/unicode-data-0.4.0.1/revisions/

wismill commented 4 months ago

Released with Unicode 15.1:

wismill commented 4 months ago

I sent an issue to the CLC, about a new core library unicode-data-core.