Culture revamp, cacheable resources, small tweaks and fixes

briantist commented 6 years ago

The big news here is the beginnings of adding support for multiple cultures and languages. Directory structure and file naming have been changed for that, and all methods referencing those files have as well.

To that end, I've removed "floating" variables that contain information that should be localized (namely the $nouns, $adjectives, and$verbs` variables), but to get back the performance benefits I've implemented a system whereby the contents of such files are cached when asked for so that they aren't read again (lots of options for overriding and controlling that behavior). Methods relying on that content have been updated to be localization aware.

Some methods that relied on hardcoded information have been changed to use localized files (like the space function, since space characters are not the same in every language and culture).

The randomdate function has been updated to be localization aware and to take parameters for min and max date. The only breaking change in this PR from my perspective is changing the defaults for these values. I've changed them to [DateTime]::MinValue and [DateTime]::MaxValue respectively. They can now be overridden i.e. ig '[randomdate 1/1/1970 12/31/2070] but if you feel strongly about your 1/1/1970 and [DateTime]::Now defaults I'll keep the params and change their values.

Added [colors] (American English, British English, and Japanese), but the lists are pretty short.

There's more work to be done in adding localization support, that I haven't gotten to yet (namely script support which would replace direct alphabet support, haven't fully figured out how to do that yet). Adding support for other number notations should be interesting too, especially in a way choosable by the user, as many cultures use several number systems.

I've also had a lot of trouble finding word lists for non-American English, especially lists that have part-of-speech so that nouns, adjectives, and verbs can be shown appropriately. Any help in finding good comprehensive word lists in any language is much appreciated.

Also need to add documentation for new methods and patterns.

Feedback welcome while I finish fleshing out some of the changes above.

dfinke commented 6 years ago

Excellent, looking forward to checking it out. Looks like the tests failed on Linux

briantist commented 6 years ago

The tests failed on linux going back a month though, was that from your test demo maybe? I remember some demonstrations about tests for different platforms. The failure is about Get-CimInstance which isn't really used in this module anyway.

briantist commented 6 years ago

One other slight change you might consider "breaking" about randomdate. The output format default was changed from MM/dd/yyyy to the culture-specific "ShortDateFormat". For en-US this is M/d/yyyy so the "new" default for users previously already in en-US culture no longer pads with 0 on the month and day. For international users they'll see whatever their usual short date format is.

But I did also forget to mention that the format is now a parameter and thus overridable as well, but you'd have to pick a min and max first to get at it when using Invoke-Generate template strings.

dfinke commented 6 years ago

My fault. I used the tests in a demo to show how it failed on Linux. You can comment that out. With these changes, probably going to push the version 2.0. Wonder if it's worth putting some basic tests in the current one to test simple values returned from each template?

cdhunt commented 6 years ago

Tests are always a good thing, but I think we'd need to refactor some more to separate the random data generation from formatting. We could also use tests for template parsing.

dfinke commented 6 years ago

Agreed. Figured since this PR is refactoring and probably unintentionally breaking things, wondering if it's worth to put the effort in for some "smoke" tests before merging the PR. Or just version this 2.0, put tests in now and know it is breaking, which it will be anyway.

cdhunt commented 6 years ago

I still need some PRs for Hacktober, but this month is terrible for me personal time wise. Testing random data generations for multiple cultures is going to be a challenge. We could do some simple tests like the test for the -count parameter.

Here are some simple ones off of the top of my head.

# should be an [int].
[adjective] [noun] should contain two words and not be "[adjective noun]"
[space] length should be 1

Looking at the code, I think the Caching logic is way overly complicated. The size of the data is so small, you could just load it all into memory statically. Then you just need one helper function that loads a given culture that runs with CurrentCulture by default when the module is loaded.

The culture support is awesome and the module is due for a refactor from the ad-hoc, proff-of-concept origins.

dfinke commented 6 years ago

Agreed, not sure how to test a random data generator. First thought was to mock the functions and just make sure they were called.

Simple test may be just use a template and see if it returns data.

Just to see that things still "work"

briantist commented 6 years ago

I did fix the tests already by removing the invalid one and adding a single culture-specific test, but I agree that there's a lot of refactoring to be done that would help in writing and performing tests.

I tend to try to solve everything at once so I'm trying to show some restraint in not changing absolutely everything at once.

I have a few ideas on how to generalize some of this module in a way that:

Better lends itself to localizaton
Allows for easier testing
Allows for easier and more standardized extensibility and contributions

But for now I wanted to at least start getting feedback on changes thus far.

I may yet still add some more tests; writing that "simple" test to ensure spaces re returned with culture in mind did actually surface a bug that I also fixed, it's amazing how even simple tests are so useful.

briantist commented 6 years ago

@cdhunt re:

Looking at the code, I think the Caching logic is way overly complicated. The size of the data is so small, you could just load it all into memory statically. Then you just need one helper function that loads a given culture that runs with CurrentCulture by default when the module is loaded.

Could you expand on "load it into memory statically"? Culture isn't a set-it-one-time thing, as you can call individual functions with individual cultures, and you can change your thread culture on the fly. While the data is small now, it may not stay that way, and multiplied by additional cultures it further grows. Even if the total size is "small enough" we would still need a way to organize it, which the caching does now.

The caching logic is meant give template function authors a method to load content from a file and handle keeping that data in memory once asked for, so that when you call ig '[color]' -Count 1000 you aren't reading from the disk 1000 times for no reason, and to provide this in a way that authors don't have to think too hard about it.

Some things could be changed, but overall it makes for (what I think is) a straightforward invocation for templates, while keeping the logic consistent (in one place).

The caching logic is separate from culture resolving logic; you can use the cache content functions with a direct path that has nothing to do with cultures.

You can also use culture logic (Resolve-LocalizedPath) to find a culture-specific filename and then not use it with cache functions.

But I'd love to get more specific feedback on your complexity concerns and what changes you'd like to see. My responses here are kind of guessing at what you're getting at so maybe I can address your concerns more directly.

cdhunt commented 6 years ago

Right now, all data for all available cultures is under 50 KB. I don't think that warrants any kind of memory management logic. Just load all of the data on module load. You can use a compound key like en-US,names with a simple [hashtable]. That should support 30ish cultures while still using under 1 MB.

Sorry if it sounds like I'm attacking your caching logic. It's good code, just way more than I think this module needs for a long time - if ever. I 100% support keeping it in memory and avoiding file access for every function call. A few line for-loop should be all that's needed to load the data and your lookup code will probably also be a little bit simplified replacing path checking for a $null value check. And since it's all in memory, you could add something like Show-Cultures so the user knows what is available.

Running some benchmarks with Measure-Command against old and new versions would also be useful information to include in Readme/changelog.

dfinke commented 6 years ago

@briantist Cool code. I'd prefer to wait on caching. I put this module together as a lark based on a TypeScript implementation I thought was cool. I always like using the file system to "index" into subsections of data rather than manage it in memory.

briantist commented 5 years ago

I think there's some confusion around what I've done here, and it's really my fault for not explaining any of it in a big-ish PR, and providing no context, comments, or documentation for it, so hopefully I can fix that!

@cdhunt I think you're correct about the size of the data, and that even in the future, the total size will never be a problem to have in memory all at once.

But that was not what led to this design anyway. And I want to stress that there is no "memory management logic", for that reason. I never bother taking anything out of memory unless asked by some invocation. At no point are statistics kept, or a certain memory size targeted, or hit counts evaluated, and I also don't think that will ever be needed.

I think perhaps I should not have used the term "caching" at all, as it seems to imply a lot that doesn't exist.

The main expense this saves isn't even disk access, it's processing time for anything that doesn't directly use the data that lives on disk (like Import-Csv converting to an object).

The design goal of this "caching" mechanism was a near-minimal way to do a simple "store it in memory and forget about it" that meets a few criteria:

The caller doesn't know or care whether it's currently in memory or not, it just asks for a file (re: @dfinke 's indexing via the file system)
The -Cacheable functions are basically drop-in replacements for existing functions, which makes them easy to use (or opt out of) for other contributors
The -Cacheable functions are separate so that other contributors are not obligated to use or provide caching for some other data type that isn't yet in use; for example if they want to load a .json file they don't need to implement Import-CacheableJson right now (or ever). We still get a contribution and someone else could even implement the cacheable part (again, or not!)
It allows for arbitrary transformations: this is helpful to either 1) use an existing dataset but change it on read for a limited purpose (I used that in the fortnite function) or 2) change it when you read it from the file and then write it to the cache so it only has to be done once. This idea hit me as I was looking for some data files in various languages and came upon the widely used hunspell format for spellcheckers. It lists words but also uses a suffix format at the end of some words. I thought these might be a good source of wordlists, even if we just chop off the suffix, and it would be great not to have to pre-process them before adding them to project (that would also make them more updatable) and this is a perfect application of doing something like -Transform { $_ -replace '/.*?$' } and then that's it.
Most importantly, the "extra" features are opt-in: they're there when you need them, ignore them when you don't and things just work like they do with Get-Content or Import-Csv or whatever.
I really made an effort to ensure that this was an easily consumable pattern that also is not forced onto anyone.

So all that being said, I want to directly address your suggestion to read all the files at once in the beginning, but first I have to talk about the culture-resolution logic, which is by-design completely independent.

I know I'm being a bit verbose here but if this code lands I figure this comment will be the basis for documentation of this anyway.

When I first started looking into how to do this, I read a bunch of resources on organizing localization resources, and one of those was Microsoft's document called Hierarchical organization of resources for localization. This was the inspiration for the file structure I chose (even though they're talking about in-process resources), and I'll briefly explain why:

Cultures are separated into language (neutral culture) and specific-culture, so something like en-US is English language in the United States, as opposed to en-CA (English in Canada).

The recommendation they have is taking advantage of this hierarchy so that you get good fallback when you don't have something culture-specific but you can provide their language. If someone is running on the en-CA culture we don't want to spit out an error saying "sorry we only have nouns in American English" (of course), and rather than duplicate the en-US files to every other possible en- culture (there are a lot of them) we should take advantage of the filesystem's hierarchical nature.

Brief note: a neutral-culture is also a valid culture, so setting your program's culture to just en is a definitely possibility.

This has some cool implications for us if we resolve culture at the file level, because it means (to start) supporting a more specific culture doesn't have to be all-or-nothing once we create the specific directory.

I demonstrated this with the colors.txt file. You'll note it's the only file in en-GB right now, and in fact the list is the same as the one in en, with the sole exception of the spelling of grey vs gray.

But the great thing about that is that you can do ig 'I once saw a [color] [noun]' and if you're in en-GB, you'll get the British colors and the American nouns because we don't have British nouns yet.

Further, because culture becomes a parameter to a template function, you can mix cultures regardless of what your app/thread is set to: ig 'I heard [color] is written as [color ja] in Japanese'

The downside of this hierarchy, is perhaps that this logic could be seen as "complex", but I would say, that's why it's wrapped in a Resolve-Path style call, again making it super easy for the implementer of a template function. Just 1) accept a culture parameter (copy paste from another), 2) ask for the file you want (don't think about paths), and 3) pass that culture right through. You get a culture-resolved full path that you then do whatever you want with. And it is again something that an implementer could ignore completely and read their own file just to quickly contribute something, if it were really a dealbreaker.

Some of this stuff will end up being really important in trying to support languages that don't use alphabets, I think, but again that part isn't fully fleshed out.

So, the reason I say all that is that your suggestion to read all the files at the beginning would break a few things on both the caching side (I don't really care too much), and the culture logic side (bigger deal).

Memory isn't a concern, and we could just gci -Recurse the whole cultures path and read them in, but we would then be making the following decisions:

The module would be enforcing which file types are acceptable (because we have to know on module load), so an author who wants to use a new format, has to put the data-reading logic in this initialization function, which I think is messier and more importantly, a barrier to entry.
Transforms, which should generally be specific to the template function, have to be pushed to initialization, introducing coupling.
Culture resolution, to support the already existing features has to be rewritten to index into the cache now; you suggested using the culture string like en-US as part of the key, but to support fallback language support, we have to do the same processing as before, so in essence we would just do the same path resolution, but instead of letting the caller do what they want with the path we're basically forced to just take it from cache. It's the same amount of work, pushed to a different place in the code path, but offers less flexibility (IMO).
Alternatively, we don't support language fallback, or we support a gimped version where any match that's not exact throws an error, or always gives US English (I didn't mention it before but any language that doesn't exist currently falls back en which we've (I've) decides is US English). I think that'd be a shame, as we'd lose both the "transparent" override ability (like the en-GB colors) and the ability to mix cultures or request a specific one in a specific call a la ig 'My favorite color in Japanese is [color ja]'.

So after all that, I'll try to summarize:

I think the "caching" I've put here is much simpler than it seems at first, and probably aligns more with @cdhunt 's ideas than it looks.
Although I think it would be unfortunate, scrapping everything cache-wise if @dfinke doesn't want it wouldn't be the end of the world and is almost completely a find/replace; easy to do.
Culture resolution would suffer most if we tried to cache everything at the beginning, and I'd really prefer not to abandon those features.

Sorry a lot of this was missing earlier, and I hope this sheds some light on the design decisions and makes them a bit less scary or confusing. Still happy to discuss further!

briantist commented 5 years ago

Also just want to say I love the Show-Cultures idea, maybe coupled with a Show-Languages since that will be more relevant to most people (or it would just be a parameter, whatever). That would be cool, and I could pretty easily implement it now as well.

briantist commented 5 years ago

Also I ran some benchmarks and to my surprise, even just reads with Get-Content really are a low slower:

C:\Scripts\NameIT [master]> 1..10 | % { Measure-Command -Expression { 1..1000 | % { $z = Get-Content -Path .\cultures\en\names.csv } } } | ft

Days Hours Minutes Seconds Milliseconds
---- ----- ------- ------- ------------
0    0     0       8       332
0    0     0       8       215
0    0     0       8       418
0    0     0       8       299
0    0     0       8       304
0    0     0       8       177
0    0     0       8       299
0    0     0       8       274
0    0     0       8       306
0    0     0       8       170

C:\Scripts\NameIT [master]> 1..10 | % { Measure-Command -Expression { 1..1000 | % { $z = Get-CacheableContent -Path .\cultures\en\names.csv } } } | ft

Days Hours Minutes Seconds Milliseconds
---- ----- ------- ------- ------------
0    0     0       0       364
0    0     0       0       296
0    0     0       0       298
0    0     0       0       318
0    0     0       0       310
0    0     0       0       304
0    0     0       0       303
0    0     0       0       335
0    0     0       0       321
0    0     0       0       321

Import-Csv vs Import-CacheableCsv were only slightly more dramatic (about 11-12 seconds vs ~300 ms).

Of course this is a lot more iterations than a normal use, so if the consensus is still "rip out cache" then so be it, I was just surprised there was a bigger difference between gc and ipcsv

dfinke / NameIT

Culture revamp, cacheable resources, small tweaks and fixes #20