asmagill commented 4 years ago

I'm working on a module hs.text which I hope will address most of the ideas and concerns described in #1452 and others... the idea is that the userdata objects for the module will contain the text as raw data with an encoding type (specified or detected). The string and utf8 methods will be implemented in a way that means that most of the time you don't need to know what a specific objects encoding type is, and there will be a method with which you can get the raw bytes of the text converted to any encoding you choose (with options for lossless and lossy conversions).

Constructors will allow you to supply raw bytes you've received or created yourself, and I'm planning on one that will read a file in (bypassing the need to temporarily store it in Lua wasting time/memory). I'm thinking longer term (i.e. it may not be in the first version I hope to have up for testing in a couple of days), I'll also need to see about making hs.http aware of the module so HTTP requests can be dumped directly into an hs.text object as well.

I'm sure I'll have more questions as I go, hence creating this new issue, but some initial implementation questions:

the object will be a Lua userdata object so it can remain native as much as possible -- this reduces time/memory in moving the raw data back and forth between the Lua instance. Should the __tostring method return a value similar to existing userdata objects from our modules (e.g. hs.screen: Color LCD (0x6040014fd078)). I'm thinking, since the intent of the module is that everything it represents actually is a valid string in some encoding, that it should return the NSString version of the data so things like print(object) do what you'd expect for any variable representing a string/text value.
when comparing objects directly (e.g. __eq, __lt, and __le), should the raw data be compared or should the strings be compared for equivalence (e.g. if objA is in UTF8 and objB is in UTF16, the raw data may differ even when the strings represent the same text). Or should there be a separate equivalentTo method?
should comparisons (built in or by a method) consider Unicode compositions as equal to the cases where a single unicode entity exists for the character? (e.g. is (U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS) equal to or different from (U+004F LATIN CAPITAL LETTER O)(U+0308 COMBINING DIAERESIS)?) (Note that the default NSString method isEqualToString: considers these different)
additional modules that will need to be made hs.text aware? hs.http is already on my list, once the core module is out for review/testing.

latenitefilms commented 4 years ago

AMAZING!!

I’ll let @randomeizer reply to your questions, as he’s far smarter than I.

Thanks, as always, for all your incredible help and support!

randomeizer commented 4 years ago

Hi there! Sounds good - that was basically the gist of cp.text, but in Lua-land.

Regarding your questions:

Should the __tostring method return a value similar to existing userdata objects from our modules (e.g. hs.screen: Color LCD (0x6040014fd078)).

As you suggested, having it return a NSString with just the lua-friendly text seems the most logical here.

when comparing objects directly (e.g. __eq, __lt, and __le), should the raw data be compared or should the strings be compared for equivalence (e.g. if objA is in UTF8 and objB is in UTF16, the raw data may differ even when the strings represent the same text). Or should there be a separate equivalentTo method?

__lt and __le don't really make much sense in UTF land unless you are comparing the strings as code points, imho. Given that the objective is to make the actual encoding generally a hidden thing, my inclination would be to make the meta methods work as you would expect regardless of the actual encoding of the two strings.

should comparisons (built in or by a method) consider Unicode compositions as equal to the cases where a single unicode entity exists for the character? (e.g. is (U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS) equal to or different from (U+004F LATIN CAPITAL LETTER O)(U+0308 COMBINING DIAERESIS)?) (Note that the default NSString method isEqualToString: considers these different)

Hmm, tricky. My inclination would be to have the default __eq to consider the composed and single characters to be equivalent (same as isEqual), since that's what I would generally expect, and a separate isSame (or something) function which works like isEqualToString if you want the more efficient version that doesn't require decomposition/composition of the characters.

additional modules that will need to be made hs.text aware? hs.http is already on my list, once the core module is out for review/testing.

Maybe hs.javascript, hs.json, hs.plist, hs.utf8?

Hopefully that's helpful.

randomeizer commented 4 years ago

Possibly relevant: https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

asmagill commented 4 years ago

Right now, what __tostring does depends upon if an encoding was specified when the object was created:

if an encoding was specified, the raw data is converted to an NSString using the specified encoding and is then pushed onto the stack which is returned to lua -- what this means is that whatever the "official" encoding of the data is, what is pushed onto the stack is actually a UTF8 version that things like print can display in the console.
if no encoding was specified when the object was created, then the raw data is pushed onto the stack -- it may be valid UTF8 or it may not (and thus trigger the console's inclusion of the bulleted question mark for bogus characters)

__le and __lt usually aren't implemented in our modules because comparisons don't really make sense... in this case, it could be argued that since the object represents text of some sort, we might want to be able to compare them for sorting, etc. The argument "for" is simple -- a simple way to determine sort order. The arguments against, however are (at least in my mind) a little more compelling the more I think about them, so I solicit your opinions:

sort order can differ depending upon your locale (or the locale the text was created in)
the objects may or may not be in the same encoding -- how does an Ascii encoded object compare against a Simplified Chinese (GB 2312) encoded one?
even limited to Unicode, we run into the issues specified in @randomeizer's Normalization link above (which I think will likely end up a submodule of this after I get the basics working)

I'm leaning now towards __eq just dealing with the equality of the actual userdata object, __le and __lt being skipped as we usually do, and a separate comparison method or methods which take options to address the concerns listed above should be implemented.

asmagill commented 4 years ago

Another thought just crossed my mind, since I'm also thinking about implementing the __concat metamethod for easy concatenation -- it is possible to write the comparison and concatenation metamethods so that they only work if the objects are both of the same encoding type -- it is valid lua behavior to throw an error for one of these methods if the "types" don't match (e.g. 1 < "a")

asmagill commented 4 years ago

(should have just written a longer single entry)

Basically, the ideas behind the metamethods is to make things easier for the user (in this case, the programmers, and as the impetus for this module, that initially means @latenitefilms and @randomeizer); no userdata object requires any of them. So...

For your current and expected use cases, does the ability to concatenate two non-UTF8 encoded objects come up often enough that you'd like the short cut of obj1 .. obj2 or would a method like obj1:concat(obj2, [options ...]) be sufficient?

Likewise, do you compare them often enough that you'd like to be able to do things like obj1 < obj2, etc. or is obj1:compare(obj2, [options...]) sufficient?

Note that Implementing any of the metamethods does not preclude more powerful methods -- in fact I can almost guarantee that the compare one will be added in any case because of the variety of special cases that come up (case insensitive, etc.)

latenitefilms commented 4 years ago

I'll leave this to @randomeizer again, as I feel like this discussion is well and truly above my pay grade.

randomeizer commented 4 years ago

So basically, it looks like the main place we use the UTF-16 code is when we load localized strings files. I don't believe we do any real string manipulation currently, including concatenation.

That said, let me respond to your questions/comments.

the raw data is converted to an NSString using the specified encoding and is then pushed onto the stack which is returned to lua...what is pushed onto the stack is actually a UTF8 version

I'm not super-familiar with what happens to objects on the stack, but to my knowledge, NSString stores the values as 16-bit codepoints internally, not UTF-8? Is there some auto-conversion of NSString to UTF-8 "Lua String" that happens between Objective-C and Lua land? Or are you saying that you export a UTF-8 'raw data' version from the NSString instance before pushing it onto the stack?

__le and __lt usually aren't implemented in our modules because comparisons don't really make sense

It would definitely be complicated, particularly with locales. At the base level, you could just sort on 'codepoint' value, but a) doesn't seem like you're talking about preserving that and b) not sure how meaningful that would be anyway. I'd be comfortable with the library providing some 'comparison function' generators that could be passed into table.sort or whatever, which are based on encoding and locale.

In our cp.text library, I implemented __eq, but not the comparator functions, so we're not using them anyway.

the __concat metamethod for easy concatenation

Could be useful, although if it's only half-implementing string behaviour, not sure if anyone would expect it. And I imagine it would get tricky/impossible when you have mixed encodings. And as I mentioned, we're not really using it currently anyway.

Some questions for you:

If the returned value is essentially UTF-8 string data, why wrap it in an objectdata at all? Why not just return the string, decoded and re-encoded as UTF-8 instead? If it is an object that has some other useful properties, I think the simplest option would be to store characters as 16-bit codepoints, which seems to be what NSString does anyway, in which case, could we just wrap NSString instead?

I guess I don't have a clear idea of what the API you're proposing to build is yet.

latenitefilms commented 4 years ago

FWIW... I think maybe this encoding stuff is one of the reasons why I always had a hard time decoding base64 data from a binary property list. I'd always get unexpected results using hs.base64 and instead used openssl via hs.execute() to get things to work.

asmagill commented 4 years ago

For those watching at home, I had to take a break this week from working on this, but I've uploaded the current progress and notes to https://github.com/asmagill/hammerspoon_asm/tree/master/text

It's not really useful yet, other than for identifying possible valid encodings for a string of bytes and converting between them, but I figured I should show my progress thus far.

latenitefilms commented 4 years ago

Amazing! Thanks heaps!!

randomeizer commented 4 years ago

Hey there, had a brief look over the code, but more over the NOTES, so thought I'd throw in my $0.02 here on your comments:

should integer key to __index return "char" at that position (e.g. `hs.text:sub(key,key)`?
should _len return the same thing as `len` below with no options? or not be implemented?

To me, when dealing with text, it only makes sense to treat items as characters. Individual bytes are not super useful. Not sure what options you're considering for len, but for me, I just want characters, not bytes.

    if __index above implemented, then this should too

Agreed.

> inspect1(string)
{
// Done
  lower = <function 9>,
  upper = <function 17>

// Planned
  find = <function 4>,      // support for patterns uncertain

Find is fun to implement. You can see my version in the cp.text.match class, but I think it could be done better in native code. It might be possible to essentially parse the lua find syntax into standard regex for processing.

What would be great is if you could create something similar to cp.text.match where you can actually save parsed patterns in an object for reuse, which this function then uses.

And/or just allowing full regex support in lua...

  len = <function 8>,       // `hs.text:len` will likely combine this and `utf8.len` possibly with additional options

I'd be interested to know what options you're considering.

  match = <function 10>,    // support for patterns uncertain
  reverse = <function 14>,

I guess reverse should reverse the character order, not necessarily the byte order. Of course, that depends on knowing what that is, and if there is any other encoding weirdness for a particular encoding.

  sub = <function 15>,

This is fairly important, although I guess it could be done to the exported UTF-8 string instead.

// Uncertain
  gmatch = <function 6>,
  gsub = <function 7>,
  rep = <function 13>,

Yeah, dependent on whether matching is supported. These functions took the most time to write in cp.text.

// Probably Not
  byte = <function 1>,      // specific to single byte encodings; use `tostring(hs.text object):byte([i],[j])`

Yeah, makes sense.

  char = <function 2>,      // use `hs.text.new(string.char(...))`

Is this what index/len would provide above? Returning characters? Is it returning a string? With UTF-8 encoding? Or does it return a 32-bit integer?

  format = <function 5>,    // use `hs.text.new(string.format(...))` to create formatted string in required encoding

Agreed.

// No
  dump = <function 3>,      // binary representation of lua functions -- encoding would destroy
  pack = <function 11>,     // binary encoding of data in portable string -- encoding would destroy
  packsize = <function 12>, // binary encoding of data in portable string -- encoding would destroy
  unpack = <function 16>,   // binary encoding of data in portable string -- encoding would destroy
}

Agreed.

> inspect1(utf8)
{
// Planned
  codepoint = <function 2>, // can this work with *all* encodings or just limited to unicode/ascii/simple?

What other encodings are possible?

  codes = <function 3>,     // can this work with *all* encodings or just limited to unicode/ascii/simple?

Yeah, not sure about some of the more esoteric encodings. I would probably expect the code to be whatever a particular character's Unicode code point is, so not sure how to map something like Japanese ECU Encoding to Unicode, unless it's already built in.

All that said, not sure how often you would encounter some of those more obscure encodings these days. May depend on where you come from.

  offset = <function 5>     // make more generic -- n'th char of encoding, return byte position in the rawData

I guess? How are bytes accessed?

// Probably Not
  char = <function 1>,      // see `codepointToUTF8` below (a "safer" version of this that doesn't barf on invalid codepoints)
  len = <function 4>,       // `hs.text:len` will likely combine this and `string.len` possibly with additional options

// No
  charpattern = "....",     // used to iterate UTF8 in 8bit world; better to use module len and sub

Agreed.

}

> inspect1(hs.utf8)
{
// Uncertain
  codepointToUTF8 = <function 2>,   // use `hs.text.new(hs.utf8.codepointToUTF8(...))`; maybe replicate as `hs.text` constructor?
                                    // need to research what surrogate region used for and if current implementation covers all
                                    // possibilities in all unicode variants (8, 16, 32, be, le) before making generic constructor.

// No
  asciiOnly = <function 1>,         // use `hs.utf8.asciiOnly(tostring(hs.text object))` or `hs.utf8.asciiOnly(hs.text:rawData())`
  fixUTF8 = <function 3>,           // use `hs.text:asEncoding(#, true)`
  hexDump = <function 4>,           // use `hs.utf8.hexDump(tostring(hs.text object))` or `hs.utf8.hexDump(hs.text:rawData())`
  registerCodepoint = <function 5>, // n/a
  registeredKeys = {...},           // n/a
  registeredLabels = {...}          // n/a
}

I'm not too familiary with hs.utf8, but the assumptions here seem reasonable.

asmagill commented 4 years ago

@randomeizer, some thoughts re your comments:

First, unless specifically stated elsewhere (when I get around to writing the docs), assume everything is in terms of characters -- if you want to deal in bytes, there is a method hs.text:rawData() which will return a lua string containing the raw data un-encoded by Objective-C. (Ok, not 100% true -- you can explicitly specify an encoding type of 0 to the constructor which means "no encoding what-so-ever", but the main reason I included this is so that you can then use hs.text:validEncodings() and hs.text:guessEncoding() on data that you know nothing about -- it's expected that for most cases, data won't stay type 0 for long.)

A little background on how hs.text is doing things:

For an hs.text object, the raw bytes are stored as anNSData` object and an encoding type is specified -- this simplifies converting between encodings or seeing what possible encodings a given sequence of bytes could be valid for, determining if the assumed/expected/specified encoding is valid for the data (which could indicate errors in transmission or corruption), etc. All manipulation is done by converting the NSData object to an NSString object with the specified encoding and since, internally, NSString objects are UTF16 based, I'm working under the assumption that every valid character in every encoding has a UTF16 representation (possibly composed) -- otherwise NSString couldn't contain the character.

Comments concerning things in your post above:

Looking closer at len, the only real question about options is do we want to replicate the indices of utf8.len? From the docs:

utf8.len (s [, i [, j]])

Returns the number of UTF-8 characters in string s that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a false value plus the position of the first invalid byte.

While it's a little unclear, i and j specify bytes in the lua string, so it is possible to specify indices which split valid UTF8 characters... I'm leaning towards not adding this, as you can easily do utf8.len(tostring(hs.text object), i, j), but this does involve moving the data into Lua-land, so... since one of the impetuses behind this module is to reduce time/memory moving data back and forth, do you think you'd take advantage of this "feature" of the utf8 module or is it an edge case not worth worrying about right now?

I see __index with an integer key returning an hs.text object for the character at that position with the same encoding as the parent object. This may be a little overkill for a single character since this will be at most 4 bytes, but simplifies concatenation constructors and will let me leverage the sub method.

string.char returns a string composed of the ascii codes specified as arguments, so hs.text.new(string.char(...)) seems to be the hs.text equivalent... I don't know that this would benefit from being fully native, since you're still having to pass in the numbers, but I guess a syntatic-sugar constructor written in the lua portion of the module's code is a one-liner anyways, so I'll probably add it.

As for possible encodings, on my machine I see (there are a handful of duplicate numbers as I explicitly add entries for the encodings which have Objective-C defined constants with "simpler" names -- hs.text.encodingTypes.UTF16 is easier to type/read than hs.text.encodingTypes["Unicode (UTF-16)"] when specifying the machine native UTF16 encoding to a constructor or conversion method):

> t = require("hs.text")

> t.encodingTypes
ASCII                              1
Arabic (DOS)                       2147484697
Arabic (ISO 8859-6)                2147484166
Arabic (Mac OS)                    2147483652
Arabic (Windows)                   2147484934
Baltic (DOS)                       2147484678
Baltic (ISO Latin 7)               2147484173
Baltic (Windows)                   2147484935
Canadian French (DOS)              2147484696
Celtic (ISO Latin 8)               2147484174
Celtic (Mac OS)                    2147483687
Central European (DOS Latin 2)     2147484690
Central European (ISO Latin 2)     9
Central European (ISO Latin 4)     2147484164
Central European (Mac OS)          2147483677
Central European (Windows Latin 2) 15
Chinese (GB 18030)                 2147485234
Chinese (GBK)                      2147485233
Chinese (ISO 2022-CN)              2147485744
Croatian (Mac OS)                  2147483684
Cyrillic (DOS)                     2147484691
Cyrillic (ISO 8859-5)              2147484165
Cyrillic (KOI8-R)                  2147486210
Cyrillic (Mac OS Ukrainian)        2147483800
Cyrillic (Mac OS)                  2147483655
Cyrillic (Windows)                 11
Devanagari (Mac OS)                2147483657
Dingbats (Mac OS)                  2147483682
Farsi (Mac OS)                     2147483788
Gaelic (Mac OS)                    2147483688
Greek (DOS Greek 1)                2147484689
Greek (DOS Greek 2)                2147484700
Greek (DOS)                        2147484677
Greek (ISO 8859-7)                 2147484167
Greek (Mac OS)                     2147483654
Greek (Windows)                    13
Gujarati (Mac OS)                  2147483659
Gurmukhi (Mac OS)                  2147483658
Hebrew (DOS)                       2147484695
Hebrew (ISO 8859-8)                2147484168
Hebrew (Mac OS)                    2147483653
Hebrew (Windows)                   2147484933
ISO2022JP                          21
ISOLatin1                          5
ISOLatin2                          9
Icelandic (DOS)                    2147484694
Icelandic (Mac OS)                 2147483685
Inuit (Mac OS)                     2147483884
Japanese (EUC)                     3
Japanese (ISO 2022-JP)             21
Japanese (ISO 2022-JP-1)           2147485730
Japanese (ISO 2022-JP-2)           2147485729
Japanese (Mac OS)                  2147483649
Japanese (Shift JIS X0213)         2147485224
Japanese (Shift JIS)               2147486209
Japanese (Windows, DOS)            8
JapaneseEUC                        3
Keyboard Symbols (Mac OS)          2147483689
Korean (EUC)                       2147486016
Korean (ISO 2022-KR)               2147485760
Korean (Mac OS)                    2147483651
Korean (Windows, DOS)              2147484706
Latin-US (DOS)                     2147484672
MacOSRoman                         30
NEXTSTEP                           2
Non-lossy ASCII                    7
NonLossyASCII                      7
Nordic (DOS)                       2147484698
Nordic (ISO Latin 6)               2147484170
Portuguese (DOS)                   2147484693
Romanian (ISO Latin 10)            2147484176
Romanian (Mac OS)                  2147483686
Russian (DOS)                      2147484699
ShiftJIS                           8
Simplified Chinese (GB 2312)       2147486000
Simplified Chinese (HZ GB 2312)    2147486213
Simplified Chinese (Mac OS)        2147483673
Simplified Chinese (Windows, DOS)  2147484705
Symbol                             6
Symbol (Mac OS)                    6
Thai (ISO 8859-11)                 2147484171
Thai (Mac OS)                      2147483669
Thai (Windows, DOS)                2147484701
Tibetan (Mac OS)                   2147483674
Traditional Chinese (Big 5 HKSCS)  2147486214
Traditional Chinese (Big 5)        2147486211
Traditional Chinese (Big 5-E)      2147486217
Traditional Chinese (EUC)          2147486001
Traditional Chinese (Mac OS)       2147483650
Traditional Chinese (Windows, DOS) 2147484707
Turkish (DOS)                      2147484692
Turkish (ISO Latin 5)              2147484169
Turkish (Mac OS)                   2147483683
Turkish (Windows Latin 5)          14
UTF16                              10
UTF16BigEndian                     2415919360
UTF16LittleEndian                  2483028224
UTF32                              2348810496
UTF32BigEndian                     2550137088
UTF32LittleEndian                  2617245952
UTF8                               4
Ukrainian (KOI8-U)                 2147486216
Unicode                            10
Unicode (UTF-16)                   10
Unicode (UTF-16BE)                 2415919360
Unicode (UTF-16LE)                 2483028224
Unicode (UTF-32)                   2348810496
Unicode (UTF-32BE)                 2550137088
Unicode (UTF-32LE)                 2617245952
Unicode (UTF-7)                    2214592768
Unicode (UTF-8)                    4
Vietnamese (Windows)               2147484936
Western (ASCII)                    1
Western (DOS Latin 1)              2147484688
Western (EBCDIC Latin 1)           2147486722
Western (EBCDIC Latin Core)        2147486721
Western (ISO Latin 1)              5
Western (ISO Latin 3)              2147484163
Western (ISO Latin 9)              2147484175
Western (Mac Mail)                 2147486212
Western (Mac OS Roman)             30
Western (NextStep)                 2
Western (Windows Latin 1)          12
WindowsCP1250                      15
WindowsCP1251                      11
WindowsCP1252                      12
WindowsCP1253                      13
WindowsCP1254                      14

Other countries may have more/less, not sure.

Re utf8.codepoint, utf8.codes, and utf8.offset, I guess I answered my own question above in describing how NSString works... basically, I have to assume that yes, all "characters" are representable as UTF16 (and thus 8 or 32) or NSString couldn't work with them. As such, the real question is whether there is a better name for the methods that is less Unicode-centric.

My tentative road-map at present is to finish a few more of the low hanging fruit methods (substring, length, etc.), then look at your code for Command Post and see what to replicate from there, then think about implementing find, compare, etc. and determine how best to handle patterns. Then normalization.

And I suppose I should add documentation strings at some point along the way so you can start trying this out and giving input as to whether the module is actually helpful or needs to change direction :-)

I expect to have some time to dedicate to this in a few days, rather than just the odd moment here and there I've had this last week, so hopefully this weekend I'll have something you can actually start to use.

randomeizer commented 4 years ago

Hey there, great stuff.

One main question: Why store the value as NSData + Encoding, rather than just wrapping an NSString, essentially? I guess it means you could write the data out again in its original format?

For our usage, we are generally trying to read in text from a variety of UTF formats (which we probably don't know exactly which in advance), and sometimes writing UTF-8. Having dug into NSString a bit now, our cp.text library is essentially a poor-man's NSString in that regard - it loads text and stores it internally as a list of codepoints. In our case, we're stuck with using Lua numbers, which I believe are 32-bit (or are they 64-bit now?), whereas NSString can be smarter about the size of the numbers based on the content of the string. NSString obviously has better support for loading various encodings, etc, and has been battle-tested for decades...

asmagill commented 4 years ago

Re keeping it as NSData -- a design choice. I'm thinking more broadly about where the data can come from, once other modules are made aware/compatible with it, and I can imagine situations where we can't always know what the encoding is up front (or trust the source to be correct), or that the data might have been corrupted along the way -- by keeping it as NSData, we're not actually changing the original in any way -- even the methods which return the object encoded differently actually return a new object (though of course if you do something like obj = obj:asEncoding(hs.text.encodingTypes.UTF32) you'll replace the original object, but that's your choice, not something imposed by the module).

randomeizer commented 4 years ago

I see where you're coming from, but it certainly does make things like implementing __len, __index, etc more difficult, since you essentially have to write your own implementation of each possible encoding to figure it out, right?

asmagill commented 4 years ago

2280 prompts me to ask a question re this module -- for `match`, `find`, `gmatch`, and `gsub`, I'm leveraging the macOS `NSRegularExpression` class which, as you might expect, uses formal regular expressions and not lua syntax... I've been going back in forth in my mind about whether or not to add code to optionally convert lua style pattern matching symbols to their regular expression equivalent... thoughts? (While I haven't looked at the full spec yet, IIRC, the lua pattern matching syntax is functionally a subset of RE with some syntax changes -- e.g. `%` instead of `\`, and a couple of others -- so it should be do-able.)

FWIW, in the current code pushed to the repository identified above, I've implemented find and match (but you have to use the regular expression style for pattern matching) but am still working on gsub and gmatch.

Also, for those interested, I've separated it into hs.text and hs.text.utf16 -- hs.text handles encoding related stuff (and will include functions to read/write files and interface with hs.http at some point, but not yet...) while mimicking the string and utf8 libraries has been moved into hs.text.utf16 because implementing them in an encoding independent manner was proving to be a royal pain in the A%$# (as a professional programmer friend once told me, "call that type of problem 'non-trivial' when talking about it to others")... plus I really didn't want to have to really dig into how each encoding differed to handle their quirks.

My reasoning is that since macOS treats all NSString objects (and variants) as UTF16 internally, for all practical purposes that implies that if the mac can handle it, then UTF16 can fully represent it. The process will be: read/get the data in whatever encoding you find, convert it to UTF16, use the submodule to manipulate it, convert it back to the necessary encoding, and then write/push it back.

latenitefilms commented 4 years ago

FWIW - I think as long as it's well documented, it doesn't really matter if regular expression or Lua patterns are used throughout Hammerspoon.

cmsj commented 4 years ago

If it helps, I much prefer REs over Lua patterns :)

randomeizer commented 4 years ago

Ugh, I would love to have REs in Lua. I know that Lua patterns are way simpler, but they are definitely restrictive.

Regarding UTF-16, agree that UTF-8 and UTF-16 will cover all our use cases in CP at the present time. I don't foresee other random encodings popping up, but I think we could use hs.text to decode them to UTF-8 anyway, right?

latenitefilms commented 4 years ago

@randomeizer - This might be of interest?

https://github.com/mah0x211/lua-regex

randomeizer commented 4 years ago

Sure, although I note two things:

Last update 3 years ago
"NOTE: this module is under heavy development."

But in general, having the ability to compile a RegEx into a reusable pattern would be great. All the Lua Pattern stuff is "use once, throw away", which is less than ideal as well.

This does seem like a separate feature request though.

asmagill commented 4 years ago

@randomeizer, re UTF8, there are two approaches you can take -- if you want to keep the text as an Objective-C object (i.e. remove the overhead of moving it into/out of lua memory), then you can use hs.text to convert between UTF8 and UTF16; if you want to use the actual lua string and utf8 libraries, then you can use tostring(<hs.text.urf16 object>). This will be described more fully in the documentation for the module when I start on it, hopefully this weekend.

In general I also prefer regular expressions (so much so that once this module is completed, I'm considering adding a version that works directly on lua strings as well); the only annoyance is that because lua uses \ to escape characters in strings, you either have to create the regular expressions with bracket quotes (e.g. [[\d+]]) or by escaping the backslash (e.g. "\\d+").

randomeizer commented 4 years ago

Regarding conversion of hs.text, sounds good.

I'm used to doing regexes in Java/JavaScript, so the double backslash is a known quantity. Having [[...]] syntax is in Lua nice though. I guess that's why Lua patterns use '%' instead.

And yes, a more general hs.regex package that works with standard Lua strings would be awesome.

asmagill commented 4 years ago

Ok, for those of you interested, I've finally got the module in what I hope is close to its final state. It's been moved to it's own repo at https://github.com/asmagill/hs.text.

hs.text has functions for reading and writing files and converting between encodings
hs.text.utf16 contains working versions of all of the string and utf8 functions that seem appropriate
hs.text.http has hs.text aware versions of hs.http functions.

There are a few more things that NSString offers that I may add over time, but it does approach parity with the lua libraries and what I think CommandPost's libraries require.

The biggest things to remember when using hs.text.utf16 are:

everything is in terms of UTF16 characters (16bit) rather than bytes. There are some functions/methods that are surrogate and composed character aware (and are documented as such), but things like length, etc. are not in terms of bytes.
Lua pattern matching is not supported for the gmatch, gsub, etc. equivalents -- you must use regular expressions. Apple's docs refer to http://userguide.icu-project.org/strings/regexp as a reference, and so far I haven't noticed any discrepancies.

If you use the inline help in the Hammerspoon console or associated webview, loading the module does incorporate the documentation as hs.text.

Hammer at it, review the documentation and offer suggestions, I'd like to move this to core at some point in the near future when I'm reasonably comfortable that it won't blow up anything.

latenitefilms commented 4 years ago

Amazing!! Thank you!

To be honest, all this text encoding stuff is way above my pay grade, so I look forward to @randomeizer having a proper play at some point.

Thanks again!

randomeizer commented 4 years ago

Looks great! That will definitely exceed our use cases at present.

Regarding the note in code regarding whether to support/require "%" as the escape modifier in the various match/gsub functions, I would say no. Just make them regular, official regex syntax only. Using [[xxxxx]] as a way of avoiding double-escaping the \ is acceptable.

If I was being greedy, I'd request a Lua-friendly wrapper for NSRegularExpression (hs.text.regex maybe?) which would accept either Lua string or hs.text.utf16 values, and provided methods similar to the standard Lua string pattern matcher functions (match, gsub, etc). The advantage being that I can store the pattern for later reuse, as opposed to using the string/hs.text.utf16 methods, which create then throw away the NSRegularExpression object every time you call it.

This would be pretty similar to the implementation you have for those methods in hs.text.utf16, except that you pass in the string being match or gsubd, rather than the reverse. Eg:

local regex = require("hs.text.regex")
local emailPattern = regex.new [[(.+)@(.+)]]
local username, server = emailPattern:match("foobar@gmail.com")
print(username, server) -- "foobar     gmail.com"

asmagill commented 4 years ago

I'd been thinking about adding a submodule that replicated the utf16 methods that use regular expressions for lua strings, but hadn't fully made up my mind since it would basically replicate the string library but with a more powerful syntax that 90% of users would probably never take advantage of (and if you truly do need it for utf8, you can always do something like hs.text.utf16.new(utf8String):gsub(pattern, replacement) anyways)... making it a true regular expression submodule where you pass strings/utf16 objects to the expression, rather than the other way around sounds better... let me mull it a bit and see when I might be able to add it.

randomeizer commented 4 years ago

Potentially you could migrate the existing code you have in hs.text.utf16:match(...) etc into the external module and just call it from inside the utf16 methods instead, rather than duplicating the code.

asmagill commented 4 years ago

My initial thoughts are that match might work well being replaced, but the others (gmatch, gsub, find) may be little too lua specific in their approach and semantics; however, let me start replicating the `NSRegularExpression methods and enumerators and see...

I do want to keep the utf16 submodule as close to the stiring and utf8 libraries as possible so that someone who is familiar with lua syntax, but happens to need to work with an encoding other than utf8 finds a (mostly) familiar interface... that said, it shouldn't be difficult to make the pattern parameter for the relevant methods in utf16 accept an hs.text.regex object as an alternative to the current string/utf16 it expects. For performance purposes, it should also be possible to bypass the creation of the expression if you do use an hs.text.regex object since there are no prohibitions to subclassing NSRegularExpression.

randomeizer commented 4 years ago

I'm not sure I would want NSRegularExpression replicated exactly - it has a lot of fiddly extra options as requirements that I honestly don't usually care about. NSRanges are annoying to deal with, etc. About 97% of my uses of Lua patterns involve calling match, 2% find < 1% gsub/gmatch, but you've basically implemented those options so why not include them?

asmagill commented 3 years ago

@latenitefilms not sure if you're using this module in your production code or not... if you are, I just pushed the first release with an official hs.text.regex sub-module.

I haven't implemented the full gamut of possible methods with callbacks yet, but it does replicate match, find, and gmatch support for both strings and hs.text.utf16 objects -- in fact the hs.text.utf16 versions of those methods are now just wrappers for the hs.text.regex versions, so let me know if you see any change in behavior... you shouldn't, but we all know how that goes....

If you are using this in your code, I highly recommend upgrading, even if you're not ready to test the regex stuff yet -- I discovered that I had included both selfRef and selfRefCount approaches for managing userdata objects (not sure why as I've known for a long time not to mix them) which means the previous versions will leak memory.

Still working on gsub, but figured I'd get this out for testing.

@randomeizer I take your points, and for 95+% of cases, once I add gsub, it probably will be sufficient, but the benefits of adding the two primary block-enumeration methods of the class are callbacks with progress updates and background threading for particularly complex patterns... maybe only useful for 1% or so of cases, but if needed, very useful... extremely large text blocks (which is one of the reasons hs.text was conceived -- to keep such out of Lua-space unless absolutely required) and particularly complex regex patterns can be slow.

asmagill commented 3 years ago

And as usual, some syntax errors from inadequate or incomplete testing after hand merging branches... ok, let me troubleshoot a bit...

latenitefilms commented 3 years ago

Amazing! I think currently we're only using hs.text for making things upper and lower case, but will definitely swap it out with the latest and greatest once it's ready.

asmagill commented 3 years ago

Ok, fixed the hand-merge error, so now it's ready for testing!

latenitefilms commented 3 years ago

Swapped in... so far, so good. Will let you know if we spot any bugs as we start using it more. Thanks heaps!!

asmagill commented 3 years ago

hs.text.regex.gsubIn added and hs.text.utf16.gsub now uses it (and fixes some errors with table lookups that I missed the first time around).

While there are still a couple of "power user" methods I want to add, this is probably ready for a PR once I've hammered on it a bit more to make sure I haven't missed any obvious errors as it should cover 90+% of expected use cases as is. The other methods can be added at a later time.

@latenitefilms what portions of hs.text do you primarily use (this will help me know where to focus my banging as I can assume what you're using probably works or I would have heard about it by now 😜)

Hammerspoon / hammerspoon

Create `hs.text` module to handle different string encodings #2215