alan-if / alan

ALAN IF compilers and interpreters

https://alanif.se

Other

19 stars 3 forks source link

BUG: ARun w/ UTF-8 BOM Solution Files #32

Closed tajmone closed 3 years ago

tajmone commented 3 years ago

ARun is not consuming the BOM when the -u switch is explicitly set, causing the BOM character to slip through as part of the firs player input command emitted.

See:

https://github.com/alan-if/Alan-Testbed/tree/master/utf8/run-ascii

where the output_utf8-bom_-u.a3t transcript contains the BOM:

> <0xfeff>x table
Conversion of command input from UTF-8 failed, are you sure about the input encoding? ('Illegal byte sequence')
I don't understand.

The same problem is seen in output_utf8-bom.a3t, which is generated using the same UTF-8 BOM solution file, but without passing the -u switch to ARun; except that we don't see the encoding problem suggestion as above:

> ï»¿x table
I don't know the word 'ï'.

Here the BOM in this case is seen as corrupt chars, instead as a binary entity, but that's only because the file is seen as ISO-8859-1 by my editor, whereas the previous one is detected as UTF-8. Maybe this has to do with the byte order of these chars? this might explain the difference seen with -u and without it.

BOM sanitation should be applied by ARun when auto-detecting UTF-8 solution files, as well as when passing the -u switch.

I also noticed that when passing an UTF-8 + BOM file, or using the -u switch, the generated transcript (redirected to file) will be in UTF-8, but always without BOM. Probably this makes sense, since who would need a BOM in the generated file anyhow?

But I was wondering if for consistency sake it might good to have an extra option to enforce the BOM on generated transcripts — I'm just thinking aloud here, and have no particular edge- or use-case in mind.

tajmone commented 3 years ago

Just to recap on the role played by the BOM in UTF-8 files, to ensure we're on the same wavelength here.

The BOM in UTF-8 files is optional (is accepted by the standard, not discouraged nor encouraged, although the common practice is to leave it out unless needed).

My argument in favour of always using a BOM for UTF-8 files is mostly due the transition from ISO encoded ALAN files to the new UTF-8 support feature.

Because sources containing only ASCII chars are identical in ISO-8859-1 and UTF-8 (without BOM), the risk is that such files might get corrupted when non-ASCII chars are added to them — taking on either ISO or UTF-8 encoding, depending on the editor's settings and/or expectation for ALAN related files.

In the Sublime ALAN package, if I drop ISO encoding as the default encoding for all ALAN files (.alan, .i, .a3s, .a3t), I will expose myself to the risk of corrupting any files with only ASCII chars, for they would be considered as UTF-8 by default. Since I still need to work with both ISO and UTF-8 files, due to the fact that most projects are still in ISO, I'm relying on the presence of a BOM to enforce UTF-8 encoding, and will consider as ISO encoded any files that don't have a BOM or an UTF-8 encoded characters sequence.

I recently had to deal with similar problems in a big project containing .txt documentation files in English, French and German, where the expected encoding was ISO-8859-1. Most file became corrupted when users started to contribute changes: some were being commuted as UTF-8, other as Windows-1252, and other with some other ISO encodings, depending on the contributors' editors settings and native OS locale.

When we realized the presence and size of the problem, it was too late, the documentation would produce corrupted chars when converted (HTML, PDF, CHM), and fixing the problem with iconv was no longer possible, due to to same have been re-encoded multiple times with different encodings, until they satisfied no known encoding any longer.

Since the great bulk of ALAN's legacy is all in ISO encoding, and the transition to UTF-8 is going to happen mostly in our own repositories and projects (surely, all the ALAN sources in the wild are not going to change, e.g. in the IF Archive, IFDB, etc.), we're most likely going to face a very long transition where both encodings will have to be handled in our editors.

Hence my emphasis on using the BOM, and at least ensuring that it can be safely used with any ALAN source file and script. Transcripts might be the exception, since they are not sources but generated files, so technically speaking no one should be manually editing them. But IMO, having an option to enforce the BOM even on generated transcripts might be useful in some special contexts.

thoni56 commented 3 years ago

BOM detection on solution/script files has not been implemented in the interpreter. I think I actually forgot about that in my list of steps, so thanks for this reminder.

And, yes, there is no BOM generated in interpreter output files. Actually not for auxillary output files from the compiler either, like the compilation listing file (which might actually be the only one).

I think you make a good point about the transition period and how easy it is to slip up when you are jumping from one setting from another. So I agree that more failsafes are something good, and even though the BOM is often not present in environments where it is not "needed", it would hopefully not hurt to have it.

I'm guessing that some old editors might have problems, but there are many that support it that we can point to, if someone has an issue with that.

tajmone commented 3 years ago

I'm guessing that some old editors might have problems, but there are many that support it that we can point to, if someone has an issue with that.

I hadn't thought of that, as I don't recall using any Unicode aware editor that failed handling the BOM, but it's indeed possible..

Notepad++ is a nice and very popular FOSS editor that handles well encodings and BOM, it could be a good candidate to suggest for non-programmer end user who need a general purpose text editor that's not too hard to learn to use, yet has enough features to provide a neat editing experience. Also, it should be fairly easy to create a simple ALAN plug-in for Npp, providing some basic highlighting, file extensions associations and a simple build system. I'll look into it.

thoni56 commented 3 years ago

There is a fix for BOM handling in transcripts and solution files upcoming in the running CI build.

If -u is used generated transcript and solution files will have a BOM. No need for a special BOM option.

For solution files that are read using the @ meta command the interpreter will look for a BOM and if found (possibly) switch to UTF encoding for the extent of that file and of course skip over the BOM.

Note however, that this does not apply to solution files with a BOM that is piped directly as input to the interpreter. Looking for a BOM in this situation does not match up with how the code is structured and I haven't figured out a way to do that without some substantial restructuring of the code handling command input yet.

tajmone commented 3 years ago

Note however, that this does not apply to solution files with a BOM that is piped directly as input to the interpreter. Looking for a BOM in this situation does not match up with how the code is structured and I haven't figured out a way to do that without some substantial restructuring of the code handling command input yet.

OK. This shouldn't be a problem, I can always use the dedicated ARun switch to feed the solution files; what really matters is being able to use redirection for the generated transcripts, since I need to bypass the time-stamp being added to the transcript file (since they are tracked by the test suites, as well as documentation repositories that need them for inclusion in the AsciiDoc sources).

thoni56 commented 3 years ago

since I need to bypass the time-stamp being added to the transcript file

You do know about the -r option? (https://alan-if.github.io/alan-docs/manual-beta/manual.html#_interpreter_switches)

thoni56 commented 3 years ago

Ok, so if a file is piped into the interpreter would not know about it. It can't differentiate between a human and a piped file.

But if it was a UTF BOM file then the first input line would contain the BOM. It is unlikely that a human would enter those characters, especially on the first command.

Also as the input reaches the end of file the game quits. So there is no need to know when the file ended and switching back to "human" mode.

I think looking for a BOM as the first three characters of input and switching to UTF-8 would be doable. I'll make an attempt at that.

tajmone commented 3 years ago

You do know about the -roption?

I wasn't aware it also removes time stamps! I actually always use that option in my toolchains and test suites.

I think looking for a BOM as the first three characters of input and switching to UTF-8 would be doable. I'll make an attempt at that.

that would be a great addition. Although your reasoning that humans wouldn't type a BOM is sound, many end users might be feeding solutions with a BOM via pipes and redirections for various reasons, e.g. third party tools that emulate or automate game sessions (e.g. a tool line Inform7 Skein), if they rely on files which have a BOM.

thoni56 commented 3 years ago

many end users might be feeding solutions with a BOM via pipes and redirections for various reasons, e.g. third party tools that emulate or automate game sessions (e.g. a tool line Inform7 Skein), if they rely on files which have a BOM.

I'm not sure I understand what you mean here.

Do you mean that the sequence of tools would introduce a BOM in the flow? If so, they shouldn't do that unless it is a UTF-8 flow, and then it doesn't matter since the interpreter will still switch to UTF-8 for that input.

tajmone commented 3 years ago

Do you mean that the sequence of tools would introduce a BOM in the flow?

Possibly, it depends on the language being used, and how it handles (or no handles) BOMs found in external files (i.e. whether its native file I/O is able to autodetect and strip a BOM, or whether it just passes it along). The point is that there's always a possibility that solution files will be piped/redirected as they are (BOM including) by a toolchain.

If so, they shouldn't do that unless it is a UTF-8 flow

Not necessarily, I've worked with languages that store strings internally as UCS-2, and they don't strip out a BOM from external UTF.-8 files unless you explicitly invoke a native function to do so (i.e. if you don't, the BOM will show up as odd chars in the final string). Also, the BOM stripping function was added to the language at a later time, although reading UTF-8 files was already supported via additional parameters.

It's hard to tell how different languages deal with encodings. Surely, the big languages tend to handle encodings fairly well (but not so well when it comes to legacy encodings, for they usually treat ISO-8859-1 and Windows-1252 as being identical, although they are not, due to a recent encodings library from the Node JS world, which has gained traction and was ported to other langs as well).

Truth is, encodings and EOLs are a mess, and you simply can't expect that any tool or languages handles them properly. Especially EOLs; there are so many cross-platform tools, that are developed on Linux and cross-compiled to Windows, which contain edge cases bugs for CRLF. Often legacy encoding and EOL features are added for the sake of completion, but rarely used by their developers, which is why bugs often slip through and are not covered by tests suites.

thoni56 commented 3 years ago

Right. Yes, other tools might do wrong things, but I have no intention of supporting every combination of random tools that have bugs in them ;-) If someone puts together such a tool-chain and it does not work, I'll try to make sure that the Alan tools do what they are supposed to do.

And in this particular case the interpreter is accepting human input or assuming that the file read by @ or is piped as standard input follows the textfile conventions of the platform.

If a toolchain creates something that does not match that prerequisite/assumption then there is not much the interpreter can do.

But I suppose that indicates that we need to be very clear about those assumptions in the documentation. I'll make a not of that.

tajmone commented 3 years ago

Why the `-r` Switch Is Not Used in Test Suites

I've tested the -l and -r switches, and now I remembered why I don't/can't use them in most repositories toolchains and test suites.

The problem is that I have multiple solution files for each adventure (one adventure, multiple individual tests), and the generated transcript filenames must match that of the solution file, not the compiled adventure. The -l option doesn't accept a filename to enforce the name of the generated transcript, so using this switch will make all tests transcripts overwrite each other, since they'll be all names like the .a3c file.

tajmone commented 3 years ago

Build2218 Partial Fix

OK, I've tested the new Build 2218 and the first command now works, but the problem is that BOM still makes its way in the generated transcript (in the middle of the transcript, where the first command is injected):

BEFORE

without -u:

> ï»¿x table
I don't know the word 'ï'.

using -u:

> <0xfeff>x table
I don't know the word 'ï'.

Build 2218

both with and without -u:

> <0xfeff>x table
It's an old wooden table. On the table you can see a cake.

as you can see the BOM is found preceding the first command of the solution file — i.e. it's stripped (or ignored) by the parser but not from the transcript string.

thoni56 commented 3 years ago

Do you have example files that I can use to re-create this problem? Or more explicit re-creation instructions?

tajmone commented 3 years ago

Here's the folder with the ALAN source, command scripts and all:

https://github.com/alan-if/Alan-Testbed/tree/master/utf8/run-ascii

thoni56 commented 3 years ago

That doesn't happen to me. I looks like CI missed a few pushes. May I ask you to retry with the latest build again (now 2220)?

tajmone commented 3 years ago

I've downloaded build 2220, same results though.

thoni56 commented 3 years ago

Finally, I could reproduce this. I tried fresh builds on Linux and Cygwin, I've downloaded the 2220 arun and used that in Cygwin, Msys and Git Bash terminals, but didn't see this problem.

But after re-running the exact same command many, many times it appeared once, and after a while it appears more often.

So it's a random problem, probably an uninitialized variable. So now I know where to look. Also it only happens with the arun cross-compiled on Linux in the CI environment so that makes it a bit more cumbersome to work with.

I'll get back to you when I think I have solved it.

thoni56 commented 3 years ago

Sneaky! The tests.sh script does not create transcripts, it captures the terminal output using pipes. Here's the output of a session with some extra printouts:

$ arun-ci -r -l kitchen-ascii < input_utf8-bom.a3s

Kitchen
A small and cosy house Kitchen. There is a table here. On the table you
can see a cake.

> firstInput = true
ï»¿x table
buffer: 'ï»¿x table'
buffer[0-2]: 0xef, 0xbb, 0xbf
Have BOM
buffer: 'x table'
converted: 'x table'
It's an old wooden table. On the table you can see a cake.

The first line with x table that also includes the BOM is echoed in the terminal from the piped file input. The rest clearly shows that arun does what it can to remove the BOM. It is also not present if you actually create a transcript.

So, with this setup it is the echoing from the terminal that is captured in the output file. This is highly environment dependent on the run-time and the terminal and I'm not sure what to do about it.

Theoretically it should be possible to inhibit all terminal echoing and handle all input echoing explicitly, but that is not how command line arun works right now. I'll investigate if that is possible in a cross-platform manner and how much work that would be.

In the meantime I suggest to use arun's transcript capability combined with the -r switch (that we discussed earlier). So instead of

arun kitchen-ascii.a3c < $solF > $trnF

you would do

arun -r -l kitchen-ascii.a3c < $solF
cp kitchen-ascii.a3t $trnF

With this you also don't get the status line control codes that are now present in the output. You can also remove that with -n (no status line), but since creating a true transcript is the way to go, you don't need to handle that.

tajmone commented 3 years ago

Got it @thoni56! I now understand what the problem is, and I personally think that ARun shouldn't be altered in this respect but keep abiding to the underlying echoing rules of the host OS.

In the meantime I suggest to use arun's transcript capability combined with the -r switch (that we discussed earlier).

Unfortunately, that's not a viable solution in most of my toolchains and test suites, for the reasons explained earlier.

you would do

arun -r -l kitchen-ascii.a3c < $solF
cp kitchen-ascii.a3t $trnF

The problem with the above is that often there'll be a solution file with the same basename of the adventure (e.g. test.a3c, test.a3s, test-actors.a3s), and the "generate and copy/rename" strategy would overwrite the generated transcript with the same name.

The ideal solution would be to add to ARun a new switch that allows to specify a custom transcript filename, e.g. --transcript-name <filename[.a3t]> (where the extension is optional, and assumed to be .a3t if left unspecified). That would solve the practical problems I'm facing now.

Failing that, the only solution I can come up with right now is to introduce a custom tool that strips the BOM from the solution file before feeding it to ARun. The downside of this approach is that, beside adding an extra dependency (unless I use SED), it's going to considerably slow down the toolchains' execution time.

Right now, the only thing preventing me from switching projects to use UTF-8 is this BOM injection issue.

Just to resume my overall needs and strategy:

All manually edited ALAN files (*.alan, *.i, *.a3s solutions) need to be encoded in UTF-8-BOM, to ensure that they are not autodetected/treated as US-ASCII when they contain only ASCII chars, since later edits might introduce non-ASCII chars that could lead editors to switch to an unpredictable encoding (ISO-8859-?, Windows-1252, UTF-8, or something else) depending on the OS default locale, the editor and its default settings, and other factors at play.
Auto-generated ALAN files (*.a3t transcripts) don't necessarily have to include a BOM, and it should be OK to have them as plain UTF-8. We should expect editors to correctly auto-detect their encoding when they're opened, and since they always come as "final" documents (i.e. no editing steps involved), there shouldn't really be any encoding problems here.

Also, transcripts might be used in a variety of contexts (and with various types of tools) where UTF-8 files might not be expected to have a BOM.
Being able to control the filename of ARun-generated transcripts is essential in the various test-suites and doc-building toolchains because any given folder might contain multiple ALAN adventures, and the solution files which are run against each adventure are filtered using the adventure basename, e.g.:
- house.alan → house.a3c ← house.a3s ← house-npcs.a3s
- cave.alan → cave.a3c ← cave_meta.a3s ← cave_light.a3s ← cave_walkthrogh.a3s
As you can see from the above, the generated transcripts should match the filename of their solution files, not the adventure filename. The only way to currently achieve this is via redirection, but a new ARun switch could solve the problem elegantly.

So, I think that it's not advisable to interfere with the environment echoing process. On the other hand, if adding a new ARun switch to enforce a custom filename on the generated transcript is not a problematic change, that would be the ideal solution. Failing that, I'll have to find a way to strip the BOM from solution files, on-the-fly, without slowing down execution too much.

Any thoughts and advise?

thoni56 commented 3 years ago

Got it @thoni56! I now understand what the problem is, and I personally think that ARun shouldn't be altered in this respect but keep abiding to the underlying echoing rules of the host OS.

In the meantime I suggest to use arun's transcript capability combined with the -r switch (that we discussed earlier).

Unfortunately, that's not a viable solution in most of my toolchains and test suites, for the reasons explained earlier.
you would do
arun -r -l kitchen-ascii.a3c < $solF
cp kitchen-ascii.a3t $trnF
The problem with the above is that often there'll be a solution file with the same basename of the adventure (e.g. test.a3c, test.a3s, test-actors.a3s), and the "generate and copy/rename" strategy would overwrite the generated transcript with the same name.

I'm not pretending to know everything about all your tool-chains and tests, but isn't having a solution/transcript with the same name as the game in folders where you also have other solution files that should be run just a choice? I mean you could avoid solution files with the same name as the game file in those directories by using a convention for the "main" or "canonical" solution/transcript?

The ideal solution would be to add to ARun a new switch that allows to specify a custom transcript filename, e.g. --transcript-name <filename[.a3t]> (where the extension is optional, and assumed to be .a3t if left unspecified). That would solve the practical problems I'm facing now.

I hear you. The interpreter currently has a very simple and crude "one character" option mechanism. Although copying some other library/module/code for option handling would be a "quick fix", it would probably change the api to options handling, the help output etc. which would incur some work in implementation, tests and documentation.

Failing that, the only solution I can come up with right now is to introduce a custom tool that strips the BOM from the solution file before feeding it to ARun. The downside of this approach is that, beside adding an extra dependency (unless I use SED), it's going to considerably slow down the toolchains' execution time.

If you know there's a BOM, you could also use tail.

A few tests show that sed-ing away the BOM (as suggested here) from the solution file of 84 lines takes around half the time of running that game with the file as input, as does tail -4. So if you only do those two things the execution time will be about 50% longer.

OTOH, a simple mv seems to take approximately the same amount of time, so most of that is probably process startup.

But then you also have the scripting and other steps in your toolchain so I'm guessing the difference will be measurable, but probably not noticeable to the human eye (again, I don't know the details of your toolchains).

I know of no way to just chop of three bytes from the beginning of a file without reading it. Not even with lowlevel file I/O.

(But as the saying goes, "Don't optimize what you haven't measured to be a problem." So you should probably set up real measurements before trying to optimize the execution times.)

Right now, the only thing preventing me from switching projects to use UTF-8 is this BOM injection issue.

Just to resume my overall needs and strategy:

All manually edited ALAN files (*.alan, *.i, *.a3s solutions) need to be encoded in UTF-8-BOM, to ensure that they are not autodetected/treated as US-ASCII when they contain only ASCII chars, since later edits might introduce non-ASCII chars that could lead editors to switch to an unpredictable encoding (ISO-8859-?, Windows-1252, UTF-8, or something else) depending on the OS default locale, the editor and its default settings, and other factors at play.

This BOM-for-the-future is an interesting detail, that I hadn't thought about before.

Auto-generated ALAN files (*.a3t transcripts) don't necessarily have to include a BOM, and it should be OK to have them as plain UTF-8. We should expect editors to correctly auto-detect their encoding when they're opened, and since they always come as "final" documents (i.e. no editing steps involved), there shouldn't really be any encoding problems here. Also, transcripts might be used in a variety of contexts (and with various types of tools) where UTF-8 files might not be expected to have a BOM.

Being able to control the filename of ARun-generated transcripts is essential in the various test-suites and doc-building toolchains because any given folder might contain multiple ALAN adventures, and the solution files which are run against each adventure are filtered using the adventure basename, e.g.:

house.alan → house.a3c ← house.a3s ← house-npcs.a3s

cave.alan → cave.a3c ← cave_meta.a3s ← cave_light.a3s ← cave_walkthrogh.a3s

As you can see from the above, the generated transcripts should match the filename of their solution files, not the adventure filename.

Yes, I understand that. The first line is actually two different "cases", "house" and "house-npcs", and the second line represents three cases where none actually have the same name as the game, so for them my suggested solution would work:

$ arun cave -r -t < cave_meta.a3s
$ mv cave.a3t > cave_meta.a3t
$ arun cave -r -t < cave_light.a3s
$ mv cave.a3t > cave_light.a3t
$ arun cave -r -t < cave_walkthrough.a3s
$ mv cave.a3t > cave_.walkthrough.a3t

So would the first if you just change "house.a3s" to "house-canon.a3s" or something.

The only way to currently achieve this is via redirection, but a new ARun switch could solve the problem elegantly.

So, I think that it's not advisable to interfere with the environment echoing process. On the other hand, if adding a new ARun switch to enforce a custom filename on the generated transcript is not a problematic change, that would be the ideal solution. Failing that, I'll have to find a way to strip the BOM from solution files, on-the-fly, without slowing down execution too much.

Any thoughts and advise?

As mentioned, I would revisit the naming strategy for solution files/test cases and see if it was not possible to remove this obstacle to using the currently available mechanisms. But I don't know how many files and scripts that needs to be changed to do that, of course.

(I have the same need for the regression tests for alan but use Jregr to do that, so I have no scripts to maintain, only the test cases themselves as Jregr picks them up automatically. And compares actual output to expected. Granted, I know some of your toolchains are not just tests.)

tajmone commented 3 years ago

I'm not pretending to know everything about all your tool-chains and tests, but isn't having a solution/transcript with the same name as the game in folders where you also have other solution files that should be run just a choice? I mean you could avoid solution files with the same name as the game file in those directories by using a convention for the "main" or "canonical" solution/transcript?

It's partly a choice and partly a strategy, e.g. in some documentation projects (e.g. the StdLib Manual) where code examples and their resulting transcripts are auto-generated from real ALAN code, the naming conventions are a bit stricter because they carry meaning — e.g. in terms of which source examples should be sanitized and bundled into the final package, and which ought to be ignored, but also because the ADoc sources rely on ADoc attributes for the base filename to include::[] both the ALAN source snippets and the resulting transcripts, so I'll have to tweak the attributes in some cases, if a suffix is added to the base-name. But it shouldn't be a huge problem, and most likely documenting these changes is going to take more time than actually implementing them.

But yes, I could probably fix this by enforcing an additional suffix to all solution files (e.g. adventure.a3c ← adventure-sol_*.a3s).

I hear you. The interpreter currently has a very simple and crude "one character" option mechanism. Although copying some other library/module/code for option handling would be a "quick fix", it would probably change the api to options handling, the help output etc. which would incur some work in implementation, tests and documentation.

Got it. If it's too much work and risks entangling the code then it might not be worth it.

If you know there's a BOM, you could also use tail.

I confirm that tail also ships with the Bash of the Git for Windows package, so that's a viable solution. Bare in mind that I'm considering to migrate all toolchain builds from Bash scripts to Ruby Rake — Rake is so much better than Make and shell scripting, and we have the Ruby lang as a dependency in almost all our repos, so it's reasonable to expect our users to have Ruby too. So far, all my local experiments with Rake have produced amazing results, and I'm really excited about using this great build tool.

So, in general, I'll be looking for solutions that are independent of Bash or Linux specific tools and commands, although Rake does support most of them via a Ruby gem that provides equivalent APIs to most of these commands.

A few tests show that sed-ing away the BOM (as suggested here) from the solution file of 84 lines takes around half the time of running that game with the file as input, as does tail -4. So if you only do those two things the execution time will be about 50% longer.

OTOH, a simple mv seems to take approximately the same amount of time, so most of that is probably process startup.

Interesting. I though that you could instruct sed to only act on specific line ranges (in our case, the 1st line only), and hoping that would drastically reduce overhead times, but I guess that most overhead goes into the actual file i/o initializations and processes handling, as you mentioned.

But then you also have the scripting and other steps in your toolchain so I'm guessing the difference will be measurable, but probably not noticeable to the human eye (again, I don't know the details of your toolchains).

Once I switch to Rake all these overhead problems will probably become negligible, since Rake (like Make) only re-builds when dependencies have changes. What's keeping me stuck in Rake right now is that I need to work out a solution on how to detect branch switches, so that a Beta SDK is used on master/main and an Alpha SDK is used on dev branches. I'm not sure how to make Rake detect when the branch has changed since last build, to force clearing its cache when this occurs, because different branches means different ALAN SDKs, so everything has to be rebuild (tracked files or otherwise). But I'll eventually find a solution to this.

(But as the saying goes, "Don't optimize what you haven't measured to be a problem." So you should probably set up real measurements before trying to optimize the execution times.)

That would be hard on Windows when using Bash, since everything tends to very slow in the Bash terminal emulator, not to mention that it often hangs and needs restarting after so many operations. Which is another reason why I was hoping to move all building tasks to Rake (which is very fast).

So would the first if you just change "house.a3s" to "house-canon.a3s" or something.

As mentioned, I would revisit the naming strategy for solution files/test cases and see if it was not possible to remove this obstacle to using the currently available mechanisms.

I'll do that, the "house-canon.a3s" idea is good. As mentioned earlier, I'm also convinced that adding a suffix to the solution files in those projects which have stricter naming conventions should be possible. It's worth the effort, for I'm quite excited to embrace UTF-8 in the StdLib repository.

But I don't know how many files and scripts that needs to be changed to do that, of course.

That shouldn't be a problem really, especially in the StdLib project which relies on Bash script functions being defined in shared modules scripts, so I'll have only to tweak a couple of build and deploy functions in a single file and they'll be effective on every build script; and then, of course, rename all solutions which have the adventure's base name, and amend the include::[] directives in ADoc sources for renamed transcripts.

I might have to add a new function for those folders which use different naming conventions for solutions and transcripts, but that's not too much of a big deal (surely better than bloating build execution times just to strip a BOM).

In any case, these changes are better to do now, while still using ISO encoding, than during/after the migration to UTF-8, for it's easier to test that they work correctly (re-running the tests and builds should produce no diffs in Git). After the UTF-8 migration they should still work without changes, thanks to ARun's auto-detection of UTF-8 BOM.

thoni56 commented 3 years ago

Closing this as there seems to be no more UTF-problem that we see now, to get on with Beta8 release work ;-)

tajmone commented 3 years ago

More on Transcript Naming Problems

@thoni56, I wanted to update you on my current experiments on how to workaround the problem of creating transcripts with different names than the base adventure storyfile name.

When it comes to using scripts which build all transcripts at once, there are various solutions which are all fairly simple. E.g. before invoking ARun, the script should check whether the .a3s file has the same base name as the .a3c file and either:

Skip the conversion, and after all other transcripts are generated check if there is an .a3s file same named as the .a3c and generated it now, since it won't be overwritten.
Add to the generated .a3t some prefix (e.g. ___), and after all files are processed check if a file ___<basename>.a3t exists and rename it removing the prefix.

Neither solution even requires using a temporary variable, and they don't hit performance either.

The real problem is going to be able to handle this in Rakefiles, because Rake doesn't usually iterate through every solution file but only tries to re-generate transcripts when their dependencies were updated.

This means that I'll probably need to delegate transcripts generation to a custom Ruby function, which would then have to ensure that an existing <basename>.a3t is not overwritten when re-generating a transcript.

Right now, I think the best solution would be for the function to check if a <basename>.a3t file exists, in which case it would have to temporarily rename it (e.g. adding the ___ suffix or changing its extension to .a3t_bak), then convert and rename the new transcript, and finally rename the former transcript to its original name again. This is definitely going to add some performance overhead to the process, especially if there are lots of transcripts, since it involved two extra file operations for each transcript being generated.

I was wondering if you have a better solution for this, after all this problem would be similar in Make too, which also acts of a per-file basis, according to dependencies updates. So you might have come across similar problems before.

thoni56 commented 3 years ago

As you are asking me, I'll answer from the perspective I haveof similar problems, which mostly regression testing. "Building" or "compiling" seldom has this situation.

Regression is fundamentally different from "building" since we have an "extra" dependent, the "program", SUT, software under test. Usually we run the progam over all test cases everytime, since we probably changed that, or want to make sure that all tests passed with version we have. So there is an implicit dependency that we "know" has changed, the program. This invalidates all dependencies in the test suite.

So I have never tried doing this from Make. Anyting similar usually ends up in fairly compex sequence of shell commands which later becomes a separate script.

The above reason, and of course the fact that I could not find anything to my liking ;-), was the reason I started hacking on Jregr. I'm not necessarily promoting this instead of your approach (I'm not even sure it would help), again I think the difference between testing and building makes a huge difference on how I would approach this.

But with Jregr I run all 550 testcases for Alan in 2.5 seconds, so the performace is good enough to not be prohibitive to re-generate everything, I think. Unless I had thousands of cases and at least one third of them could be avoided using a dependency check, it would make little difference to me.

So the above was just to revisit the reasons for adding the dependency complexity and if the performance is worth it.

I feel that the suggested approach to do that programatically, in some manner, is the choice I would have made too.

tajmone commented 3 years ago

Usually we run the progam over all test cases everytime, since we probably changed that, or want to make sure that all tests passed with version we have. So there is an implicit dependency that we "know" has changed, the program. This invalidates all dependencies in the test suite.

That's basically the current situation in the StdLib and StdLib Italian projects, where both the test suite and the "source transcripts" (i.e. auto-generated transcripts to be included as examples in the AsciiDoc documentation) are always re-run in bulk via scripts iterating through each ALAN source, compiling it and then running it against every solution file associated to it — but that's so mostly because I never found the time to setup a dependency based system to do, as I'm hoping to do now.

So I have never tried doing this from Make. Anyting similar usually ends up in fairly compex sequence of shell commands which later becomes a separate script.

Hopefully in Rake this should easier to achieve, thanks to file pattern rules (source → output, like in Make) and the fact that Ruby code can be interspersed inside and outside the Rake tasks to bend both the overall context as well as the single tasks to one's needs. In the small scale Rake test in this repo, for the Cloak sample adventure, it turned out fairly easy to ensure that the transcripts in the sample folder are rebuilt whenever one the following changes:

its solution source file <transcrit-name>.a3s
the cloak storyfile

the rest is handled by the other general rules, i.e. the storyfile being recompiled if the source adventure or any library files were changed, etc.

Thanks to Rake features and its tracing and dry-run options it was fairly easy to track the whole dependencies chain to ensure it was working as expected (of course, it can soon grow huge and hard to mentally track it all).

But with Jregr I run all 550 testcases for Alan in 2.5 seconds, so the performace is good enough to not be prohibitive to re-generate everything, I think. Unless I had thousands of cases and at least one third of them could be avoided using a dependency check, it would make little difference to me.

I have no experience with Jregr, but with the StdLib projects currently re-running all tests every time I'm experiencing much longer build times, well over 20 seconds per run (and the test suite is still very slim).

The worst part is updating the transcripts for the documentation, e.g. after having tweaked the source adventure of some code example I end up having to recompile all example adventures and rebuild all transcripts (a step which also involves extra operations like transcript sanitation for Asciidoctor, via SED). Here each run ends up slowing the editing process considerably, especially when there are many small tweaks and afterthoughts triggering the build over and over, hence my desire to use Rake to allow rebuilding only what really needs to be.

I agree that ultimately these types of tests and transcripts should be re-run entirely, just to be 100% sure that we (and our theoretically super-infallible Rakefile) didn't miss out any determining factor. Force building a specific task is quite easy in Rake, you just need to rake -B that task.

Rake makes it also easy to tweak the dependencies status programmatically by exposing their status attributes which can be changed on the fly, as well as allowing introspection to the whole build process from within the running Rake session itself (possibly a bit harder to handle, but definitely possible). After all, the advantage of Rake over Make is that the former is just Ruby code running in an interpreter session, which allows for ample margin of intervention thanks to the DSL of the build tools being a full language, and Ruby being an interpreted language.

So the above was just to revisit the reasons for adding the dependency complexity and if the performance is worth it.

Obviously, when the test suite is small I could just force it to rebuild each time, which is still simpler than setting up all the dependencies. But in any case, once I adopt Rake in a project I'd like to drop all the original scripts that were being used before, to avoid double standards which might introduce errors and are harder to maintain. Also, with Rake we no longer have to worry about CMD vs Bash scripts, or face problems like with the asciidoctor-fopub toolchain, which required using batch scripts for Windows and bash scripts for macOS and Linux — Rakefiles will work on any supported OS, since Ruby handles most of the crossplatform OS details automatically, and thanks to the FileUtils library emulating many Shell commands as Ruby functions.

But I think that as libraries grow, running tests and updating transcripts based on dependencies tracking is going to become a realistic need, because the waiting times are already starting to become noticeable (at least recompiling the source adventures should not be done unless strictly necessary, since that part takes longer than the ARun tasks for the transcripts).

I feel that the suggested approach to do that programatically, in some manner, is the choice I would have made too.

So far it seems the only viable option, I can't really think of an alternative approach which would reduce file operations here. Obviously, avoiding creating transcripts with the same name as the base adventure would remove the problem entirely (as mentioned earlier, by enforcing at least a single suffix char), but different context demand different transcripts naming conventions, and I'm afraid that creating different rules to handle different context could entangle the Rakefiles (multiple rules for same file patterns would introduce serious rules precedence issues, hard to track and trust).

Ultimately, I was hoping to create some reusable Rake modules especially designed for ALAN repositories. So, instead of creating ad hoc Ruby classes and functions tailored around the needs of specific repositories, I was hoping to create general purpose functions able to handle the needs of most repositories, where special behaviour could either be activated via custom parameters, or the function being smart enough to deduce the contextual needs by inspecting the source folder and its contents.

As a final note, it's worth mentioning that Rake has some limitation when invoking the sh() function: its behaviour would depend and vary on the current OS and shell being used (e.g. on Windows different results are to be expected when using the CMD, Bash or PowerShell, for example). So, whenever possible, it's best to rely on Ruby functions for any complex command invocation.

Strange as it might sound, there doesn't seem to be any simple and bullet-proof way to determine under which OS Ruby is running — you might have noticed the hack I inserted at the beginning of the Rakefile to obtain OS info. There is indeed a dedicated gem for detecting the OS, but that would add a dependency which IMO seems absurd to be needed in the first place. In other words, determining the current OS in Ruby requires environment hacks like those found in Shell scripts, which need to take also into account MSYS, MSYS2, CygWin, etc. (hard to believe that a simple OS constant isn't being defined in the Ruby interpreter at compile time).

alan-if / alan

BUG: ARun w/ UTF-8 BOM Solution Files #32

Why the -r Switch Is Not Used in Test Suites

Build2218 Partial Fix

BEFORE

Build 2218

More on Transcript Naming Problems

Why the `-r` Switch Is Not Used in Test Suites