Pull in updated version of SANY to support Unicode specs

ahelwer commented 1 month ago

Now that SANY supports parsing Unicode TLA+ specs, it should be fairly straightforward to support Unicode here by updating to a newer version of SANY. Possible bugs include if your translation code looks at actual string representation (for example, "\\E" instead of TLAplusParserConstants.EXISTS) which was the case in SANY itself.

konnov commented 1 month ago

Possible bugs include if your translation code looks at actual string representation (for example, "\\E" instead of TLAplusParserConstants.EXISTS) which was the case in SANY itself.

Oh, we have a bunch of those :)

https://github.com/apalache-mc/apalache/blob/34bdc615a26be8be71c6e9fab6c80ef2613aa456/tla-io/src/main/scala/at/forsyte/apalache/tla/imp/OpApplTranslator.scala#L627-L667

konnov commented 1 month ago

instead of TLAplusParserConstants.EXISTS

Do you know how stable this table of constants going to be? It's autogenerated. I assume that the names will not change most likely?

konnov commented 1 month ago

@ahelwer do I understand it right that the unicode operators have their own names such as op_infix_neq_uc? I wonder where to get the actual mapping from without guessing.

ahelwer commented 1 month ago

Do you know how stable this table of constants going to be? It's autogenerated. I assume that the names will not change most likely?

It has been stable for about 10 years as far as I know, no plans to change it in the future.

do I understand it right that the unicode operators have their own names such as op_infix_neq_uc

No, they do not. They are all wrapped up into the same TLAplusParserConstants token IDs.

ahelwer commented 1 month ago

If you want a quick shortcut you could also just add all the different text-based unicode variants, defined here: https://github.com/tlaplus/tlaplus-standard/blob/main/unicode/tla-unicode.csv

Although this might make it more work if any of these mappings are ever changed.

konnov commented 1 month ago

I find TLAplusParserConstants quite confusing. It contains an array called tokenImage:

https://github.com/tlaplus/tlaplus/blob/f021674983390022709009474d39021fe16e0ee1/tlatools/org.lamport.tlatools/src/tla2sany/parser/TLAplusParserConstants.java#L300

I wrote a simple Python script to do lookups in this table (see below). Surprisingly, the table contains codes both for /= and #. I guess, one of them is getting normalized somewhere in SANY?

The Python script find-token.py:

#!/usr/bin/env python

import sys

pattern = sys.argv[1]

all_tokens = ["<EOF>", "<BEGIN_MODULE>", "\"--->\"", "<_BM1>", "<token of kind 4>", "<CASE0>", "<CASE1>", "<CASE1b>", "<CASE1c>", "<CASE2>", "<CASE2b>", "<CASE2c>", "<CASE3>", "<CASE6>", "<CASE6b>", "<CASE6c>", "<CASEN>", "<LETTER>", "<DIGIT>", "<NUMBER>", "<_BM2>", "<token of kind 21>", "\" \"", "\"\\t\"", "\"\\n\"", "\"\\r\"", "<token of kind 26>", "\"\\\\*\"", "<token of kind 28>", "\"*)\"", "\"*)\"", "<token of kind 31>", "<token of kind 32>", "<token of kind 33>", "<_BM0>", "<SEPARATOR>", "<END_MODULE>", "<ACTION>", "\"ASSUME\"", "\"[]ASSUME\"", "<ASSUMPTION>", "\"CASE\"", "\"CHOOSE\"", "<CONSTANT>", "\"ELSE\"", "\"EXCEPT\"", "<EXISTS>", "\"EXTENDS\"", "<FORALL>", "\"IF\"", "\"INSTANCE\"", "\"LET\"", "\"IN\"", "\"LOCAL\"", "\"MODULE\"", "\"NEW\"", "\"OTHER\"", "<PROPOSITION>", "\"SF_\"", "\"\\\\EE\"", "\"\\\\AA\"", "\"THEN\"", "\"BY\"", "\"ONLY\"", "\"DEFINE\"", "<DF>", "\"THEOREM\"", "\"USE\"", "\"HIDE\"", "\"HAVE\"", "\"OBVIOUS\"", "\"OMITTED\"", "\"LAMBDA\"", "\"TAKE\"", "\"PROOF\"", "\"PROVE\"", "\"[]PROVE\"", "\"QED\"", "\"RECURSIVE\"", "\"STATE\"", "<TEMPORAL>", "\"PICK\"", "\"WITNESS\"", "\"SUFFICES\"", "<VARIABLE>", "\"WF_\"", "\"WITH\"", "\",\"", "\":\"", "\"::\"", "\".\"", "\"_\"", "\"==\"", "\"(\"", "\")\"", "\"-|-\"", "\"[\"", "\"]_\"", "\"]\"", "\"{|\"", "\"|}\"", "\"{\"", "\"}\"", "\"<<\"", "\">>_\"", "\">>\"", "\"!\"", "\"->\"", "\"<-\"", "\"|->\"", "<NUMBER_LITERAL>", "<STRING_LITERAL>", "<BAND>", "<BOR>", "\"\\'\"", "\"^+\"", "\"^*\"", "\"^#\"", "\"-.\"", "\"\\\\lnot\"", "\"\\\\neg\"", "\"~\"", "\"[]\"", "\"<>\"", "\"ENABLED\"", "\"UNCHANGED\"", "\"SUBSET\"", "\"UNION\"", "\"DOMAIN\"", "\"//\"", "\"/\\\\\"", "\"/=\"", "\"/\"", "\"\\\\/\"", "\"\\\\approx\"", "\"\\\\asymp\"", "\"\\\\bigcirc\"", "\"\\\\bullet\"", "\"\\\\cap\"", "\"\\\\cdot\"", "\"\\\\circ\"", "\"\\\\cong\"", "\"\\\\cup\"", "\"\\\\div\"", "\"\\\\doteq\"", "\"\\\\equiv\"", "\"\\\\geq\"", "\"\\\\gg\"", "\"\\\\in\"", "\"\\\\intersect\"", "\"\\\\union\"", "\"\\\\land\"", "\"\\\\leq\"", "\"\\\\ll\"", "\"\\\\lor\"", "\"\\\\o\"", "\"\\\\odot\"", "\"\\\\ominus\"", "\"\\\\oplus\"", "\"\\\\oslash\"", "\"\\\\otimes\"", "\"\\\\prec\"", "\"\\\\preceq\"", "\"\\\\propto\"", "\"\\\\sim\"", "\"\\\\simeq\"", "\"\\\\sqcap\"", "\"\\\\sqcup\"", "\"\\\\sqsubset\"", "\"\\\\sqsupset\"", "\"\\\\sqsubseteq\"", "\"\\\\sqsupseteq\"", "\"\\\\star\"", "\"\\\\subset\"", "\"\\\\subseteq\"", "\"\\\\succ\"", "\"\\\\succeq\"", "\"\\\\supset\"", "\"\\\\supseteq\"", "\"\\\\uplus\"", "\"\\\\wr\"", "\"\\\\\"", "\"~>\"", "\"=>\"", "\"=<\"", "\"=|\"", "\"=\"", "\"##\"", "\"#\"", "\"^^\"", "\"^\"", "\"--\"", "\"-|\"", "\"-+->\"", "\"-\"", "\"**\"", "\"*\"", "\"++\"", "\"+\"", "\"<=>\"", "\"<:\"", "\"<=\"", "\"<\"", "\">=\"", "\">\"", "\"...\"", "\"..\"", "\"||\"", "\"|\"", "\"|-\"", "\"|=\"", "\"&&\"", "\"&\"", "\"$$\"", "\"$\"", "\"??\"", "\"%%\"", "\"%\"", "\"@@\"", "\"!!\"", "\":>\"", "\":=\"", "\"::=\"", "\"(+)\"", "\"(-)\"", "\"(.)\"", "\"(/)\"", "\"(\\\\X)\"", "\"\\\\notin\"", "\"\\\\times\"", "\"\\\\X\"", "<IDENTIFIER>", "<ProofStepLexeme>", "<ProofImplicitStepLexeme>", "<ProofStepDotLexeme>", "<BareLevelLexeme>", "<UnnumberedStepLexeme>"]

for (position, token) in enumerate(all_tokens):
    if pattern in token:
        print("Position %d: token '%s'" % (position, token))

konnov commented 1 month ago

@ahelwer shall we use the version of tla2tools 1.8.0? On GitHub it appears like new commits are being added to the release that was introduced 4 years ago.

konnov commented 1 month ago

If you want a quick shortcut you could also just add all the different text-based unicode variants, defined here: https://github.com/tlaplus/tlaplus-standard/blob/main/unicode/tla-unicode.csv

Although this might make it more work if any of these mappings are ever changed.

This looks like the easiest path! Thanks!

konnov commented 1 month ago

Ugh, the latest package published on oss.sonatype.org goes back to 2020: https://oss.sonatype.org/content/repositories/snapshots/org/lamport/tla2tools/

There is no package on maven. So it seems there is no automatic way for us to consume a newer version of SANY via sbt or mvn. @lemmy any ideas how to proceed?

lemmy commented 1 month ago

I don't have time to resurrect publishing to maven central or Github's equivalent -- PR welcome. Can't you grab the tla2tools.jar from https://github.com/tlaplus/tlaplus/releases/tag/v1.8.0?

ahelwer commented 1 month ago

@konnov does Apalache pull in SANY using the maven package?

konnov commented 1 month ago

@konnov does Apalache pull in SANY using the maven package?

Apalache uses sbt, which by default pulls a package from maven. We are also using sonatype as a secondary repo:

https://github.com/apalache-mc/apalache/blob/169d142ad79c9f3553eabb949d78ec01f2116f6f/build.sbt#L26

I am not sure whether Github provides their own maven-compatible repo.

lemmy commented 1 month ago

https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-apache-maven-registry

ahelwer commented 1 month ago

Tricky thing is that I see maven has the ability to use sha1 as build version (see https://maven.apache.org/maven-ci-friendly.html) which we would need to use to distinguish the continually-updated 1.8.0 pre-release, but I can't find any details on whether various public maven repositories support that feature.

lemmy commented 1 month ago

https://maven.apache.org/guides/getting-started/index.html#what-is-a-snapshot-version

konnov commented 1 month ago

Tricky thing is that I see maven has the ability to use sha1 as build version (see https://maven.apache.org/maven-ci-friendly.html) which we would need to use to distinguish the continually-updated 1.8.0 pre-release, but I can't find any details on whether various public maven repositories support that feature.

I think this practice of updating a release file without increasing the patch number does not work well with open source software engineering practices we see on Github, Linux, etc. You could add the build number or the current date to distinguish between the builds. If you keep overwriting releases, it does not work with maven (since it caches the packages), nix, you name it. Some projects produce nightly builds and then garbage collect them, e.g., Foundry.

ahelwer commented 1 month ago

Well, we're operating in the Java packaging ecosystem here which I'm unfamiliar with - personally I'm a big fan of content-addressed versioning like with nix. I agree there are drawbacks to the SNAPSHOT way of doing things but it does map well to how we are currently just uploading tla2tools.jar builds to the 1.8.0 release and overwriting the current ones. What do you think we should do, @konnov?

I do think this is a problem we should solve because I like the idea of other projects being able to pull in the tla2tools.jar as a dependency. Currently there's Apalache and also tlaplus-formatter by @FedericoPonzi.

lemmy commented 1 month ago

Every tla2tools build is published to Github packages (the 1.7.0 build number is clearly outdate)
Pulling/Installing tla2tools.jar from Github packages with maven, sbt, ... requires authentication with a Github token
tla2tools releases used to be published at OSS sonatype
Sonatype's retention policy causes snapshot releases to be deleted periodically
The TLA+ project is favoring a rolling release model

I understand it's not cool, but the tla2tools.jar file is only 4MB in size (could be trimmed). Since it's a stable dependency for Apalache (with SANY being rarely updated), the simplest solution might be to commit the jar directly to the Apalache Git repository.

konnov commented 1 month ago

Well, we're operating in the Java packaging ecosystem here which I'm unfamiliar with - personally I'm a big fan of content-addressed versioning like with nix. I agree there are drawbacks to the SNAPSHOT way of doing things but it does map well to how we are currently just uploading tla2tools.jar builds to the 1.8.0 release and overwriting the current ones. What do you think we should do, @konnov?

I do think this is a problem we should solve because I like the idea of other projects being able to pull in the tla2tools.jar as a dependency. Currently there's Apalache and also tlaplus-formatter by @FedericoPonzi.

I would also be happy, if there was a simple way to pull tla2tools.jar as a Java dependency. I have not been working with pure Java for a long time, but afaik maven dependencies are still something that all build systems in Java understand. In this case, it would be great, if the TLA+ project could publish these jars. Actually, you don't have to do it every night. For instance, we cut releases on request with GitHub actions, see prepare-release.yml. In this case, it would be something like publishing a maven package on demand (the good thing is that you can keep sensitive data in the GitHub secrets). My main issue with contributing there is that the tlaplus project is too big. The last time I tried, I could not even compile it in the command line.

Alternatively, we could do something like z3-turnkey.

ahelwer commented 1 month ago

This would be even more work but in the future I think it would be nice to split out the SANY and PlusCal parsers into separate packages. They are self-contained with minimal dependencies, and projects consuming the tools as a dependency will probably be using them alone instead of pulling in the rest of the interpreter and model-checking code.

lemmy commented 1 month ago

Painful, but tla2tools.jar will from now on be published at https://oss.sonatype.org/content/repositories/snapshots/org/lamport/tla2tools/1.8.0-SNAPSHOT/

ahelwer commented 1 month ago

Thanks Markus! Wrangling the github CI is never a fun day.

lemmy commented 1 month ago

I recommend using the new snapshot dependency from Sonatype until the version of SANY with Unicode support is fully integrated into Apalache. Once this integration is complete, we can manually tag the TLA+ Git repository and create a permanent release on Sonatype.

konnov commented 1 month ago

Thanks a lot @lemmy!

ahelwer commented 1 month ago

tag the TLA+ Git repository and create a permanent release on Sonatype

Does this mean cutting a final release for 1.8.0 then doing a version bump for snapshot versions? If Unicode is going to be part of that release perhaps we should consider adding ℕ, ℤ, and ℝ to the standard modules. I will create an issue to consider the best way to do that.

lemmy commented 1 month ago

No, I didn't mean to suggest a full-scale tla2tools.jar and Toolbox release. What I had in mind was more like a monthly or quarterly code snapshot (not -SNAPSHOT) that would be permanently stored in the Maven repository.

apalache-mc / apalache

Pull in updated version of SANY to support Unicode specs #2995