ammar / regexp_parser

A regular expression parser library for Ruby
MIT License
143 stars 22 forks source link

Unicode blocks #9

Closed gjtorikian closed 9 years ago

gjtorikian commented 9 years ago

Closes https://github.com/ammar/regexp_parser/issues/8.

Some of the tests are failing. They are all the tests with a - character in the name, like InMiscellaneous_Mathematical_Symbols-A. I suspect this is because \p{...} does not expect a character like - to exist; possible only alphanumerics and _. I'm not familiar with Ragel at all (more of a Bison guy myself :wink:) so I'm not sure where in the code this should change. Everything else appears to be working, though.

/cc @ammar

ammar commented 9 years ago

@gjtorikian Thank you very much for taking the time to do this. It looks good.

The tests seem to be failing for two reasons:

  1. The first property name still has underscores in its name (inalphabetic_presentation_forms, property.rl line 582)
  2. The definition of property_script in property.rl (line 45) needs to be updated to match digits and -:

property_script = (alnum | space | '_' | '-')+;

Ruby accepts any variation of -, _, and spaces in property names. It just strips them out before lookup, and the scanner does the same.

Should probably find a better name for property_script too.

I will have some time to merge this and add a class for the Blocks over the coming weekend.

Thanks again and Cheers!

ammar commented 9 years ago

@gjtorikian, FYI, I just released a new version (0.3.0) with the Unicode Block support. Thanks again.

gjtorikian commented 9 years ago

Thanks for the quick release!