Cannot parse non-English characters

danka74 commented 5 years ago

The parser chokes on characters such as "ö", "ü", "ð", and "é" (to pick characters from a few SNOMED I member country languages). Strangely, the first two are not in the grammar (at least the ANTLR one which I can read) and hence should raise an error, but the latter two are.

Tried various non-English alphabet characters in the ECL tests (src/test/java/org/snomed/langauges/ecl/ECLQueryBuilderTest.java, line 36-37) like this:

ExpressionConstraint query = eclQueryBuilder.createQuery( "< 404684003 |Clönical finding| :\n" +

mvn test ... Running org.snomed.langauges.ecl.ECLQueryBuilderTest line 1:15 token recognition error at: 'ö' line 1:59 no viable alternative at input '' Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.333 sec

This impacts snowstorm as well.

/Daniel

kaicode commented 5 years ago

Hi @danka74, thanks for looking into this and providing some examples for testing!

The simplest solution here would be to strip all terms from the ECL before parsing. The returned java object does not contain any terms to this would not result in any loss of information. I'll mark it as an enhancement and "help wanted" in case some kind soul is able to contribute a fix before it gets up the priority stack at Snomed International.

Kind regards, Kai

danka74 commented 5 years ago

Hi @kaicode, I think having correct ANTLR (ABNF) would be preferable, although for ECL execution text is not needed, other applications might require the text to be kept. I could have a try at this. Is there anything specific you have been doing when generating Java from ANTLR from ABNF? Also, is this the place to update the grammar, or is there another repository (I think I remeber another repo with all grammars, but now I cannot find it)? Thanks, Daniel

danka74 commented 5 years ago

Hi @kaicode , I've added parser generation to maven in my fork, branch character_fix, https://github.com/danka74/snomed-ecl-parser/tree/characters_fix There are missing characters in the grammar file, and that seems to be the problem. I added 'ö' to the grammar file the parser behaved as expected. The grammar seems to have an odd list of characters. I'd rather reuse something from an existing lexer, e.g. https://github.com/antlr/grammars-v4/blob/master/antlr4/LexBasic.g4 Do you know if there is a particular reason the letters in the current grammar is the way it is? Also, is it important to use ABNF? Thanks, Daniel

danka74 commented 5 years ago

Hi again @kaicode , I've now updated the grammar with lexer rules for allowing characters from non-English languages, borrowing the lexer rules from the ANTLR4 grammar. I have not yet made a pull request since it's rather ugly as I've added the rule on top of the existing rules as not to break (or minimally break) any relationship with the ABNF grammar. Would like advice on how to progress, i.e. whether to clean up the ANTLR4 grammar or the ABNF grammar (is there a conversion tool?). Who wrote the ABNF grammar?

kaicode commented 5 years ago

Hi @danka74, thanks for taking a look at this.

The ANTLR was generated from ABNF using an online tool. The ANTLR parser was generated using antlr v4.5.3 - we found that later versions did not work as expected. There are instructions for this in the SCG parser repo which used the same method: https://github.com/IHTSDO/snomed-scg-parser/tree/master/parser-generation

I am not familiar with the lexer rules, I hope you can figure out how to work this into the solution. Sounds like you are making progress!

danka74 commented 5 years ago

Hi @kaicode , I have a working solution in the branch https://github.com/danka74/snomed-ecl-parser/tree/characters_fix but it was made by editing the generated ANTLR4 file. I'm not familiar with ABNF but I could try if that's important (personally I prefer ANTLR4, and the ABNF grammar seems a bit hacky but I don't know if that's due to limitations of ABNF). BTW, there was ANTLR 4.5.1 in pom.xml.

Input:

        ExpressionConstraint query = eclQueryBuilder.createQuery(
                "< 404684003 |üåäöáðę 中国 ہیلو Қазақ finding| :\n" +
                        "    47429007 |שלום with| = " +
                        "        (< 404684003 |안녕하세요 finding| : \n" +
                        "            116676008 |بهاس ملايو morphology|  = <<  55641003 |český| )");

Output:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.snomed.langauges.ecl.ECLQueryBuilderTest
line 1:59 no viable alternative at input '<EOF>'
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.378 sec

Results :

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0

danka74 commented 5 years ago

ABNF is quite annoying... I edited the ABNF and used the online translation tool to generate ANTLR4. However, the tool generates 57000 lines of ANTLR4 taking ages to process, generating bad, slow Java code or, using direct translation, it creates incorrect ANTLR4 as it mixes parser and lexer rules. When I edited the ABNF to allow generation of correct ANTLR4 code, it was no longer in sync with your Java code as refactoring was needed. As I said above, it works fine by just adding a few lines to the generated ANTLR4 but it breaks traceability to the original ABNF. Please advice...

kaicode commented 5 years ago

I'm happy with breaking the traceability to the original ABNF if you are able to write a note about how to repeat the process. That way we won't lose this enhancement when the ECL spec changes.

I love your test case 😸 All the characters!

danka74 commented 5 years ago

Would you move version to 1.0.2 so that the changed parser can be loaded into snowstorm? Or should I?

danka74 commented 5 years ago

Not good, I tried the new ECL parser on snowstorm develop and there are failures, one example out of 13:

[ERROR] Failures: [ERROR] ECLQueryServiceStatedAxiomTest.attributeGroupDisjunction:167 Match clinical finding with at least one grouped finding site attributes. expected:<[999204306007, 204306007]> but was:<[297968009, 999204306007, 204306007]>

This seems very unrelated to the character updates. However, rerunning the tests with ECL parser 1.0.1 does not give these failures. Did you do anything else to the generated ANTLR4 grammar? ...and also, I think the resulting set is actually correct in this case.

kaicode commented 5 years ago

I've moved this feature to a characters_fix branch. I've added a unit test to develop and the new branch. The unit test checks that disjunction sub-refinements are parsed correctly. This is not working in the feature branch. This is unexpected. I'm pretty sure we didn't change the generated code.. I've tried changing this project to Antlr version 4.5.1 but no joy. Will try to debug.

danka74 commented 5 years ago

@kaicode , could you push your snowstorm changes? I could have a stab as well.

Sorry, I misunderstood... please ignore...

kaicode commented 5 years ago

Definitely related to the antlr grammar file changes ... if the develop branch grammar file is pasted over the top of ECL.g4 then the parseDisjunctionSubRefinement test starts passing, although the parseNestedAttributeValue test then fails of course.

danka74 commented 5 years ago

Hi @kaicode , I created a new pull request. It seems to have been a problem with the UTF8 rules from ABNF interacting with my additions. The UTF8 rules from the ABNF are from RFC 3629 which describes the (variable) byte structure of UTF-8 (1-, 2-, 3-, 4-byte UTF-8), but as ANTLR parsers (and I guess ABNF parsers as well) are based on character streams and not byte streams, those rules should probably not have been there. I tried it with snowstorm develop branch and all tests are ok, I even added a few 'ü's and 'ö's in the ECL. The grammar could definitely be cleaned up further... but that's for another day. Cheers, Daniel

kaicode commented 5 years ago

I've merged your changes into develop and removed the temp branch now that all the unit tests are passing. I have also run the Snowstorm tests against the new version of this library and it looks good.

Is now a good time for me to create a release of this library with these changes?

danka74 commented 5 years ago

Hi @kaicode ,

Yes, please release! The grammar is still kind-of ugly, but if we clear the ABNF dependency issue with Linda (or is it Linda? Rory?) we can go ahead and do some more invasive clean up.

/Daniel

danka74 commented 5 years ago

/Daniel

kaicode commented 5 years ago

Thank you @danka74 for raising and resolving this issue!

IHTSDO / snomed-ecl-parser

Cannot parse non-English characters #1