Closed tajmone closed 3 years ago
Just to recap on the role played by the BOM in UTF-8 files, to ensure we're on the same wavelength here.
My argument in favour of always using a BOM for UTF-8 files is mostly due the transition from ISO encoded ALAN files to the new UTF-8 support feature.
Because sources containing only ASCII chars are identical in ISO-8859-1 and UTF-8 (without BOM), the risk is that such files might get corrupted when non-ASCII chars are added to them — taking on either ISO or UTF-8 encoding, depending on the editor's settings and/or expectation for ALAN related files.
In the Sublime ALAN package, if I drop ISO encoding as the default encoding for all ALAN files (.alan
, .i
, .a3s
, .a3t
), I will expose myself to the risk of corrupting any files with only ASCII chars, for they would be considered as UTF-8 by default. Since I still need to work with both ISO and UTF-8 files, due to the fact that most projects are still in ISO, I'm relying on the presence of a BOM to enforce UTF-8 encoding, and will consider as ISO encoded any files that don't have a BOM or an UTF-8 encoded characters sequence.
I recently had to deal with similar problems in a big project containing .txt
documentation files in English, French and German, where the expected encoding was ISO-8859-1. Most file became corrupted when users started to contribute changes: some were being commuted as UTF-8, other as Windows-1252, and other with some other ISO encodings, depending on the contributors' editors settings and native OS locale.
When we realized the presence and size of the problem, it was too late, the documentation would produce corrupted chars when converted (HTML, PDF, CHM), and fixing the problem with iconv
was no longer possible, due to to same have been re-encoded multiple times with different encodings, until they satisfied no known encoding any longer.
Since the great bulk of ALAN's legacy is all in ISO encoding, and the transition to UTF-8 is going to happen mostly in our own repositories and projects (surely, all the ALAN sources in the wild are not going to change, e.g. in the IF Archive, IFDB, etc.), we're most likely going to face a very long transition where both encodings will have to be handled in our editors.
Hence my emphasis on using the BOM, and at least ensuring that it can be safely used with any ALAN source file and script. Transcripts might be the exception, since they are not sources but generated files, so technically speaking no one should be manually editing them. But IMO, having an option to enforce the BOM even on generated transcripts might be useful in some special contexts.
BOM detection on solution/script files has not been implemented in the interpreter. I think I actually forgot about that in my list of steps, so thanks for this reminder.
And, yes, there is no BOM generated in interpreter output files. Actually not for auxillary output files from the compiler either, like the compilation listing file (which might actually be the only one).
I think you make a good point about the transition period and how easy it is to slip up when you are jumping from one setting from another. So I agree that more failsafes are something good, and even though the BOM is often not present in environments where it is not "needed", it would hopefully not hurt to have it.
I'm guessing that some old editors might have problems, but there are many that support it that we can point to, if someone has an issue with that.
I'm guessing that some old editors might have problems, but there are many that support it that we can point to, if someone has an issue with that.
I hadn't thought of that, as I don't recall using any Unicode aware editor that failed handling the BOM, but it's indeed possible..
Notepad++ is a nice and very popular FOSS editor that handles well encodings and BOM, it could be a good candidate to suggest for non-programmer end user who need a general purpose text editor that's not too hard to learn to use, yet has enough features to provide a neat editing experience. Also, it should be fairly easy to create a simple ALAN plug-in for Npp, providing some basic highlighting, file extensions associations and a simple build system. I'll look into it.
There is a fix for BOM handling in transcripts and solution files upcoming in the running CI build.
If -u
is used generated transcript and solution files will have a BOM. No need for a special BOM option.
For solution files that are read using the @
meta command the interpreter will look for a BOM and if found (possibly) switch to UTF encoding for the extent of that file and of course skip over the BOM.
Note however, that this does not apply to solution files with a BOM that is piped directly as input to the interpreter. Looking for a BOM in this situation does not match up with how the code is structured and I haven't figured out a way to do that without some substantial restructuring of the code handling command input yet.
Note however, that this does not apply to solution files with a BOM that is piped directly as input to the interpreter. Looking for a BOM in this situation does not match up with how the code is structured and I haven't figured out a way to do that without some substantial restructuring of the code handling command input yet.
OK. This shouldn't be a problem, I can always use the dedicated ARun switch to feed the solution files; what really matters is being able to use redirection for the generated transcripts, since I need to bypass the time-stamp being added to the transcript file (since they are tracked by the test suites, as well as documentation repositories that need them for inclusion in the AsciiDoc sources).
since I need to bypass the time-stamp being added to the transcript file
You do know about the -r
option? (https://alan-if.github.io/alan-docs/manual-beta/manual.html#_interpreter_switches)
Ok, so if a file is piped into the interpreter would not know about it. It can't differentiate between a human and a piped file.
But if it was a UTF BOM file then the first input line would contain the BOM. It is unlikely that a human would enter those characters, especially on the first command.
Also as the input reaches the end of file the game quits. So there is no need to know when the file ended and switching back to "human" mode.
I think looking for a BOM as the first three characters of input and switching to UTF-8 would be doable. I'll make an attempt at that.
You do know about the
-r
option?
I wasn't aware it also removes time stamps! I actually always use that option in my toolchains and test suites.
I think looking for a BOM as the first three characters of input and switching to UTF-8 would be doable. I'll make an attempt at that.
that would be a great addition. Although your reasoning that humans wouldn't type a BOM is sound, many end users might be feeding solutions with a BOM via pipes and redirections for various reasons, e.g. third party tools that emulate or automate game sessions (e.g. a tool line Inform7 Skein), if they rely on files which have a BOM.
many end users might be feeding solutions with a BOM via pipes and redirections for various reasons, e.g. third party tools that emulate or automate game sessions (e.g. a tool line Inform7 Skein), if they rely on files which have a BOM.
I'm not sure I understand what you mean here.
Do you mean that the sequence of tools would introduce a BOM in the flow? If so, they shouldn't do that unless it is a UTF-8 flow, and then it doesn't matter since the interpreter will still switch to UTF-8 for that input.
Do you mean that the sequence of tools would introduce a BOM in the flow?
Possibly, it depends on the language being used, and how it handles (or no handles) BOMs found in external files (i.e. whether its native file I/O is able to autodetect and strip a BOM, or whether it just passes it along). The point is that there's always a possibility that solution files will be piped/redirected as they are (BOM including) by a toolchain.
If so, they shouldn't do that unless it is a UTF-8 flow
Not necessarily, I've worked with languages that store strings internally as UCS-2, and they don't strip out a BOM from external UTF.-8 files unless you explicitly invoke a native function to do so (i.e. if you don't, the BOM will show up as odd chars in the final string). Also, the BOM stripping function was added to the language at a later time, although reading UTF-8 files was already supported via additional parameters.
It's hard to tell how different languages deal with encodings. Surely, the big languages tend to handle encodings fairly well (but not so well when it comes to legacy encodings, for they usually treat ISO-8859-1 and Windows-1252 as being identical, although they are not, due to a recent encodings library from the Node JS world, which has gained traction and was ported to other langs as well).
Truth is, encodings and EOLs are a mess, and you simply can't expect that any tool or languages handles them properly. Especially EOLs; there are so many cross-platform tools, that are developed on Linux and cross-compiled to Windows, which contain edge cases bugs for CRLF
. Often legacy encoding and EOL features are added for the sake of completion, but rarely used by their developers, which is why bugs often slip through and are not covered by tests suites.
Right. Yes, other tools might do wrong things, but I have no intention of supporting every combination of random tools that have bugs in them ;-) If someone puts together such a tool-chain and it does not work, I'll try to make sure that the Alan tools do what they are supposed to do.
And in this particular case the interpreter is accepting human input or assuming that the file read by @
or is piped as standard input follows the textfile conventions of the platform.
If a toolchain creates something that does not match that prerequisite/assumption then there is not much the interpreter can do.
But I suppose that indicates that we need to be very clear about those assumptions in the documentation. I'll make a not of that.
-r
Switch Is Not Used in Test SuitesI've tested the -l
and -r
switches, and now I remembered why I don't/can't use them in most repositories toolchains and test suites.
The problem is that I have multiple solution files for each adventure (one adventure, multiple individual tests), and the generated transcript filenames must match that of the solution file, not the compiled adventure. The -l
option doesn't accept a filename to enforce the name of the generated transcript, so using this switch will make all tests transcripts overwrite each other, since they'll be all names like the .a3c
file.
OK, I've tested the new Build 2218 and the first command now works, but the problem is that BOM still makes its way in the generated transcript (in the middle of the transcript, where the first command is injected):
without -u
:
> x table
I don't know the word 'ï'.
using -u
:
> <0xfeff>x table
I don't know the word 'ï'.
both with and without -u
:
> <0xfeff>x table
It's an old wooden table. On the table you can see a cake.
as you can see the BOM is found preceding the first command of the solution file — i.e. it's stripped (or ignored) by the parser but not from the transcript string.
Do you have example files that I can use to re-create this problem? Or more explicit re-creation instructions?
Here's the folder with the ALAN source, command scripts and all:
https://github.com/alan-if/Alan-Testbed/tree/master/utf8/run-ascii
That doesn't happen to me. I looks like CI missed a few pushes. May I ask you to retry with the latest build again (now 2220)?
I've downloaded build 2220, same results though.
Finally, I could reproduce this. I tried fresh builds on Linux and Cygwin, I've downloaded the 2220 arun
and used that in Cygwin, Msys and Git Bash terminals, but didn't see this problem.
But after re-running the exact same command many, many times it appeared once, and after a while it appears more often.
So it's a random problem, probably an uninitialized variable. So now I know where to look. Also it only happens with the arun
cross-compiled on Linux in the CI environment so that makes it a bit more cumbersome to work with.
I'll get back to you when I think I have solved it.
Sneaky! The tests.sh
script does not create transcripts, it captures the terminal output using pipes. Here's the output of a session with some extra printouts:
$ arun-ci -r -l kitchen-ascii < input_utf8-bom.a3s
Kitchen
A small and cosy house Kitchen. There is a table here. On the table you
can see a cake.
> firstInput = true
x table
buffer: 'x table'
buffer[0-2]: 0xef, 0xbb, 0xbf
Have BOM
buffer: 'x table'
converted: 'x table'
It's an old wooden table. On the table you can see a cake.
The first line with x table
that also includes the BOM is echoed in the terminal from the piped file input. The rest clearly shows that arun does what it can to remove the BOM. It is also not present if you actually create a transcript.
So, with this setup it is the echoing from the terminal that is captured in the output file. This is highly environment dependent on the run-time and the terminal and I'm not sure what to do about it.
Theoretically it should be possible to inhibit all terminal echoing and handle all input echoing explicitly, but that is not how command line arun works right now. I'll investigate if that is possible in a cross-platform manner and how much work that would be.
In the meantime I suggest to use arun
's transcript capability combined with the -r
switch (that we discussed earlier). So instead of
arun kitchen-ascii.a3c < $solF > $trnF
you would do
arun -r -l kitchen-ascii.a3c < $solF
cp kitchen-ascii.a3t $trnF
With this you also don't get the status line control codes that are now present in the output. You can also remove that with -n
(no status line), but since creating a true transcript is the way to go, you don't need to handle that.
Got it @thoni56! I now understand what the problem is, and I personally think that ARun shouldn't be altered in this respect but keep abiding to the underlying echoing rules of the host OS.
In the meantime I suggest to use arun's transcript capability combined with the
-r
switch (that we discussed earlier).
Unfortunately, that's not a viable solution in most of my toolchains and test suites, for the reasons explained earlier.
you would do
arun -r -l kitchen-ascii.a3c < $solF cp kitchen-ascii.a3t $trnF
The problem with the above is that often there'll be a solution file with the same basename of the adventure (e.g. test.a3c
, test.a3s
, test-actors.a3s
), and the "generate and copy/rename" strategy would overwrite the generated transcript with the same name.
The ideal solution would be to add to ARun a new switch that allows to specify a custom transcript filename, e.g. --transcript-name <filename[.a3t]>
(where the extension is optional, and assumed to be .a3t
if left unspecified).
That would solve the practical problems I'm facing now.
Failing that, the only solution I can come up with right now is to introduce a custom tool that strips the BOM from the solution file before feeding it to ARun. The downside of this approach is that, beside adding an extra dependency (unless I use SED), it's going to considerably slow down the toolchains' execution time.
Right now, the only thing preventing me from switching projects to use UTF-8 is this BOM injection issue.
Just to resume my overall needs and strategy:
*.alan
, *.i
, *.a3s
solutions) need to be encoded in UTF-8-BOM, to ensure that they are not autodetected/treated as US-ASCII when they contain only ASCII chars, since later edits might introduce non-ASCII chars that could lead editors to switch to an unpredictable encoding (ISO-8859-?, Windows-1252, UTF-8, or something else) depending on the OS default locale, the editor and its default settings, and other factors at play.Auto-generated ALAN files (*.a3t
transcripts) don't necessarily have to include a BOM, and it should be OK to have them as plain UTF-8. We should expect editors to correctly auto-detect their encoding when they're opened, and since they always come as "final" documents (i.e. no editing steps involved), there shouldn't really be any encoding problems here.
Also, transcripts might be used in a variety of contexts (and with various types of tools) where UTF-8 files might not be expected to have a BOM.
Being able to control the filename of ARun-generated transcripts is essential in the various test-suites and doc-building toolchains because any given folder might contain multiple ALAN adventures, and the solution files which are run against each adventure are filtered using the adventure basename, e.g.:
house.alan
→ house.a3c
← house.a3s
← house-npcs.a3s
cave.alan
→ cave.a3c
← cave_meta.a3s
← cave_light.a3s
← cave_walkthrogh.a3s
As you can see from the above, the generated transcripts should match the filename of their solution files, not the adventure filename. The only way to currently achieve this is via redirection, but a new ARun switch could solve the problem elegantly.
So, I think that it's not advisable to interfere with the environment echoing process. On the other hand, if adding a new ARun switch to enforce a custom filename on the generated transcript is not a problematic change, that would be the ideal solution. Failing that, I'll have to find a way to strip the BOM from solution files, on-the-fly, without slowing down execution too much.
Any thoughts and advise?
Got it @thoni56! I now understand what the problem is, and I personally think that ARun shouldn't be altered in this respect but keep abiding to the underlying echoing rules of the host OS.
In the meantime I suggest to use arun's transcript capability combined with the
-r
switch (that we discussed earlier).Unfortunately, that's not a viable solution in most of my toolchains and test suites, for the reasons explained earlier.
you would do
arun -r -l kitchen-ascii.a3c < $solF cp kitchen-ascii.a3t $trnF
The problem with the above is that often there'll be a solution file with the same basename of the adventure (e.g.
test.a3c
,test.a3s
,test-actors.a3s
), and the "generate and copy/rename" strategy would overwrite the generated transcript with the same name.
I'm not pretending to know everything about all your tool-chains and tests, but isn't having a solution/transcript with the same name as the game in folders where you also have other solution files that should be run just a choice? I mean you could avoid solution files with the same name as the game file in those directories by using a convention for the "main" or "canonical" solution/transcript?
The ideal solution would be to add to ARun a new switch that allows to specify a custom transcript filename, e.g.
--transcript-name <filename[.a3t]>
(where the extension is optional, and assumed to be.a3t
if left unspecified). That would solve the practical problems I'm facing now.
I hear you. The interpreter currently has a very simple and crude "one character" option mechanism. Although copying some other library/module/code for option handling would be a "quick fix", it would probably change the api to options handling, the help output etc. which would incur some work in implementation, tests and documentation.
Failing that, the only solution I can come up with right now is to introduce a custom tool that strips the BOM from the solution file before feeding it to ARun. The downside of this approach is that, beside adding an extra dependency (unless I use SED), it's going to considerably slow down the toolchains' execution time.
If you know there's a BOM, you could also use tail
.
A few tests show that sed
-ing away the BOM (as suggested here) from the solution file of 84 lines takes around half the time of running that game with the file as input, as does tail -4
. So if you only do those two things the execution time will be about 50% longer.
OTOH, a simple mv
seems to take approximately the same amount of time, so most of that is probably process startup.
But then you also have the scripting and other steps in your toolchain so I'm guessing the difference will be measurable, but probably not noticeable to the human eye (again, I don't know the details of your toolchains).
I know of no way to just chop of three bytes from the beginning of a file without reading it. Not even with lowlevel file I/O.
(But as the saying goes, "Don't optimize what you haven't measured to be a problem." So you should probably set up real measurements before trying to optimize the execution times.)
Right now, the only thing preventing me from switching projects to use UTF-8 is this BOM injection issue.
Just to resume my overall needs and strategy:
- All manually edited ALAN files (
*.alan
,*.i
,*.a3s
solutions) need to be encoded in UTF-8-BOM, to ensure that they are not autodetected/treated as US-ASCII when they contain only ASCII chars, since later edits might introduce non-ASCII chars that could lead editors to switch to an unpredictable encoding (ISO-8859-?, Windows-1252, UTF-8, or something else) depending on the OS default locale, the editor and its default settings, and other factors at play.
This BOM-for-the-future is an interesting detail, that I hadn't thought about before.
- Auto-generated ALAN files (
*.a3t
transcripts) don't necessarily have to include a BOM, and it should be OK to have them as plain UTF-8. We should expect editors to correctly auto-detect their encoding when they're opened, and since they always come as "final" documents (i.e. no editing steps involved), there shouldn't really be any encoding problems here. Also, transcripts might be used in a variety of contexts (and with various types of tools) where UTF-8 files might not be expected to have a BOM.Being able to control the filename of ARun-generated transcripts is essential in the various test-suites and doc-building toolchains because any given folder might contain multiple ALAN adventures, and the solution files which are run against each adventure are filtered using the adventure basename, e.g.:
house.alan
→house.a3c
←house.a3s
←house-npcs.a3s
cave.alan
→cave.a3c
←cave_meta.a3s
←cave_light.a3s
←cave_walkthrogh.a3s
As you can see from the above, the generated transcripts should match the filename of their solution files, not the adventure filename.
Yes, I understand that. The first line is actually two different "cases", "house" and "house-npcs", and the second line represents three cases where none actually have the same name as the game, so for them my suggested solution would work:
$ arun cave -r -t < cave_meta.a3s
$ mv cave.a3t > cave_meta.a3t
$ arun cave -r -t < cave_light.a3s
$ mv cave.a3t > cave_light.a3t
$ arun cave -r -t < cave_walkthrough.a3s
$ mv cave.a3t > cave_.walkthrough.a3t
So would the first if you just change "house.a3s" to "house-canon.a3s" or something.
The only way to currently achieve this is via redirection, but a new ARun switch could solve the problem elegantly.
So, I think that it's not advisable to interfere with the environment echoing process. On the other hand, if adding a new ARun switch to enforce a custom filename on the generated transcript is not a problematic change, that would be the ideal solution. Failing that, I'll have to find a way to strip the BOM from solution files, on-the-fly, without slowing down execution too much.
Any thoughts and advise?
As mentioned, I would revisit the naming strategy for solution files/test cases and see if it was not possible to remove this obstacle to using the currently available mechanisms. But I don't know how many files and scripts that needs to be changed to do that, of course.
(I have the same need for the regression tests for alan but use Jregr
to do that, so I have no scripts to maintain, only the test cases themselves as Jregr
picks them up automatically. And compares actual output to expected. Granted, I know some of your toolchains are not just tests.)
I'm not pretending to know everything about all your tool-chains and tests, but isn't having a solution/transcript with the same name as the game in folders where you also have other solution files that should be run just a choice? I mean you could avoid solution files with the same name as the game file in those directories by using a convention for the "main" or "canonical" solution/transcript?
It's partly a choice and partly a strategy, e.g. in some documentation projects (e.g. the StdLib Manual) where code examples and their resulting transcripts are auto-generated from real ALAN code, the naming conventions are a bit stricter because they carry meaning — e.g. in terms of which source examples should be sanitized and bundled into the final package, and which ought to be ignored, but also because the ADoc sources rely on ADoc attributes for the base filename to include::[]
both the ALAN source snippets and the resulting transcripts, so I'll have to tweak the attributes in some cases, if a suffix is added to the base-name. But it shouldn't be a huge problem, and most likely documenting these changes is going to take more time than actually implementing them.
But yes, I could probably fix this by enforcing an additional suffix to all solution files (e.g. adventure.a3c
← adventure-sol_*.a3s
).
I hear you. The interpreter currently has a very simple and crude "one character" option mechanism. Although copying some other library/module/code for option handling would be a "quick fix", it would probably change the api to options handling, the help output etc. which would incur some work in implementation, tests and documentation.
Got it. If it's too much work and risks entangling the code then it might not be worth it.
If you know there's a BOM, you could also use
tail
.
I confirm that tail
also ships with the Bash of the Git for Windows package, so that's a viable solution. Bare in mind that I'm considering to migrate all toolchain builds from Bash scripts to Ruby Rake — Rake is so much better than Make and shell scripting, and we have the Ruby lang as a dependency in almost all our repos, so it's reasonable to expect our users to have Ruby too. So far, all my local experiments with Rake have produced amazing results, and I'm really excited about using this great build tool.
So, in general, I'll be looking for solutions that are independent of Bash or Linux specific tools and commands, although Rake does support most of them via a Ruby gem that provides equivalent APIs to most of these commands.
A few tests show that
sed
-ing away the BOM (as suggested here) from the solution file of 84 lines takes around half the time of running that game with the file as input, as doestail -4
. So if you only do those two things the execution time will be about 50% longer.OTOH, a simple mv seems to take approximately the same amount of time, so most of that is probably process startup.
Interesting. I though that you could instruct sed
to only act on specific line ranges (in our case, the 1st line only), and hoping that would drastically reduce overhead times, but I guess that most overhead goes into the actual file i/o initializations and processes handling, as you mentioned.
But then you also have the scripting and other steps in your toolchain so I'm guessing the difference will be measurable, but probably not noticeable to the human eye (again, I don't know the details of your toolchains).
Once I switch to Rake all these overhead problems will probably become negligible, since Rake (like Make) only re-builds when dependencies have changes. What's keeping me stuck in Rake right now is that I need to work out a solution on how to detect branch switches, so that a Beta SDK is used on master
/main
and an Alpha SDK is used on dev branches. I'm not sure how to make Rake detect when the branch has changed since last build, to force clearing its cache when this occurs, because different branches means different ALAN SDKs, so everything has to be rebuild (tracked files or otherwise). But I'll eventually find a solution to this.
(But as the saying goes, "Don't optimize what you haven't measured to be a problem." So you should probably set up real measurements before trying to optimize the execution times.)
That would be hard on Windows when using Bash, since everything tends to very slow in the Bash terminal emulator, not to mention that it often hangs and needs restarting after so many operations. Which is another reason why I was hoping to move all building tasks to Rake (which is very fast).
So would the first if you just change "house.a3s" to "house-canon.a3s" or something.
As mentioned, I would revisit the naming strategy for solution files/test cases and see if it was not possible to remove this obstacle to using the currently available mechanisms.
I'll do that, the "house-canon.a3s" idea is good. As mentioned earlier, I'm also convinced that adding a suffix to the solution files in those projects which have stricter naming conventions should be possible. It's worth the effort, for I'm quite excited to embrace UTF-8 in the StdLib repository.
But I don't know how many files and scripts that needs to be changed to do that, of course.
That shouldn't be a problem really, especially in the StdLib project which relies on Bash script functions being defined in shared modules scripts, so I'll have only to tweak a couple of build and deploy functions in a single file and they'll be effective on every build script; and then, of course, rename all solutions which have the adventure's base name, and amend the include::[]
directives in ADoc sources for renamed transcripts.
I might have to add a new function for those folders which use different naming conventions for solutions and transcripts, but that's not too much of a big deal (surely better than bloating build execution times just to strip a BOM).
In any case, these changes are better to do now, while still using ISO encoding, than during/after the migration to UTF-8, for it's easier to test that they work correctly (re-running the tests and builds should produce no diffs in Git). After the UTF-8 migration they should still work without changes, thanks to ARun's auto-detection of UTF-8 BOM.
Closing this as there seems to be no more UTF-problem that we see now, to get on with Beta8 release work ;-)
@thoni56, I wanted to update you on my current experiments on how to workaround the problem of creating transcripts with different names than the base adventure storyfile name.
When it comes to using scripts which build all transcripts at once, there are various solutions which are all fairly simple. E.g. before invoking ARun, the script should check whether the .a3s
file has the same base name as the .a3c
file and either:
.a3s
file same named as the .a3c
and generated it now, since it won't be overwritten..a3t
some prefix (e.g. ___
), and after all files are processed check if a file ___<basename>.a3t
exists and rename it removing the prefix.Neither solution even requires using a temporary variable, and they don't hit performance either.
The real problem is going to be able to handle this in Rakefiles, because Rake doesn't usually iterate through every solution file but only tries to re-generate transcripts when their dependencies were updated.
This means that I'll probably need to delegate transcripts generation to a custom Ruby function, which would then have to ensure that an existing <basename>.a3t
is not overwritten when re-generating a transcript.
Right now, I think the best solution would be for the function to check if a <basename>.a3t
file exists, in which case it would have to temporarily rename it (e.g. adding the ___
suffix or changing its extension to .a3t_bak
), then convert and rename the new transcript, and finally rename the former transcript to its original name again. This is definitely going to add some performance overhead to the process, especially if there are lots of transcripts, since it involved two extra file operations for each transcript being generated.
I was wondering if you have a better solution for this, after all this problem would be similar in Make too, which also acts of a per-file basis, according to dependencies updates. So you might have come across similar problems before.
As you are asking me, I'll answer from the perspective I haveof similar problems, which mostly regression testing. "Building" or "compiling" seldom has this situation.
Regression is fundamentally different from "building" since we have an "extra" dependent, the "program", SUT, software under test. Usually we run the progam over all test cases everytime, since we probably changed that, or want to make sure that all tests passed with version we have. So there is an implicit dependency that we "know" has changed, the program. This invalidates all dependencies in the test suite.
So I have never tried doing this from Make. Anyting similar usually ends up in fairly compex sequence of shell commands which later becomes a separate script.
The above reason, and of course the fact that I could not find anything to my liking ;-), was the reason I started hacking on Jregr
. I'm not necessarily promoting this instead of your approach (I'm not even sure it would help), again I think the difference between testing and building makes a huge difference on how I would approach this.
But with Jregr
I run all 550 testcases for Alan in 2.5 seconds, so the performace is good enough to not be prohibitive to re-generate everything, I think. Unless I had thousands of cases and at least one third of them could be avoided using a dependency check, it would make little difference to me.
So the above was just to revisit the reasons for adding the dependency complexity and if the performance is worth it.
I feel that the suggested approach to do that programatically, in some manner, is the choice I would have made too.
Usually we run the progam over all test cases everytime, since we probably changed that, or want to make sure that all tests passed with version we have. So there is an implicit dependency that we "know" has changed, the program. This invalidates all dependencies in the test suite.
That's basically the current situation in the StdLib and StdLib Italian projects, where both the test suite and the "source transcripts" (i.e. auto-generated transcripts to be included as examples in the AsciiDoc documentation) are always re-run in bulk via scripts iterating through each ALAN source, compiling it and then running it against every solution file associated to it — but that's so mostly because I never found the time to setup a dependency based system to do, as I'm hoping to do now.
So I have never tried doing this from Make. Anyting similar usually ends up in fairly compex sequence of shell commands which later becomes a separate script.
Hopefully in Rake this should easier to achieve, thanks to file pattern rules (source → output, like in Make) and the fact that Ruby code can be interspersed inside and outside the Rake tasks to bend both the overall context as well as the single tasks to one's needs. In the small scale Rake test in this repo, for the Cloak sample adventure, it turned out fairly easy to ensure that the transcripts in the sample folder are rebuilt whenever one the following changes:
<transcrit-name>.a3s
the rest is handled by the other general rules, i.e. the storyfile being recompiled if the source adventure or any library files were changed, etc.
Thanks to Rake features and its tracing and dry-run options it was fairly easy to track the whole dependencies chain to ensure it was working as expected (of course, it can soon grow huge and hard to mentally track it all).
But with
Jregr
I run all 550 testcases for Alan in 2.5 seconds, so the performace is good enough to not be prohibitive to re-generate everything, I think. Unless I had thousands of cases and at least one third of them could be avoided using a dependency check, it would make little difference to me.
I have no experience with Jregr
, but with the StdLib projects currently re-running all tests every time I'm experiencing much longer build times, well over 20 seconds per run (and the test suite is still very slim).
The worst part is updating the transcripts for the documentation, e.g. after having tweaked the source adventure of some code example I end up having to recompile all example adventures and rebuild all transcripts (a step which also involves extra operations like transcript sanitation for Asciidoctor, via SED). Here each run ends up slowing the editing process considerably, especially when there are many small tweaks and afterthoughts triggering the build over and over, hence my desire to use Rake to allow rebuilding only what really needs to be.
I agree that ultimately these types of tests and transcripts should be re-run entirely, just to be 100% sure that we (and our theoretically super-infallible Rakefile) didn't miss out any determining factor. Force building a specific task is quite easy in Rake, you just need to rake -B
that task.
Rake makes it also easy to tweak the dependencies status programmatically by exposing their status attributes which can be changed on the fly, as well as allowing introspection to the whole build process from within the running Rake session itself (possibly a bit harder to handle, but definitely possible). After all, the advantage of Rake over Make is that the former is just Ruby code running in an interpreter session, which allows for ample margin of intervention thanks to the DSL of the build tools being a full language, and Ruby being an interpreted language.
So the above was just to revisit the reasons for adding the dependency complexity and if the performance is worth it.
Obviously, when the test suite is small I could just force it to rebuild each time, which is still simpler than setting up all the dependencies. But in any case, once I adopt Rake in a project I'd like to drop all the original scripts that were being used before, to avoid double standards which might introduce errors and are harder to maintain. Also, with Rake we no longer have to worry about CMD vs Bash scripts, or face problems like with the asciidoctor-fopub toolchain, which required using batch scripts for Windows and bash scripts for macOS and Linux — Rakefiles will work on any supported OS, since Ruby handles most of the crossplatform OS details automatically, and thanks to the FileUtils library emulating many Shell commands as Ruby functions.
But I think that as libraries grow, running tests and updating transcripts based on dependencies tracking is going to become a realistic need, because the waiting times are already starting to become noticeable (at least recompiling the source adventures should not be done unless strictly necessary, since that part takes longer than the ARun tasks for the transcripts).
I feel that the suggested approach to do that programatically, in some manner, is the choice I would have made too.
So far it seems the only viable option, I can't really think of an alternative approach which would reduce file operations here. Obviously, avoiding creating transcripts with the same name as the base adventure would remove the problem entirely (as mentioned earlier, by enforcing at least a single suffix char), but different context demand different transcripts naming conventions, and I'm afraid that creating different rules to handle different context could entangle the Rakefiles (multiple rules for same file patterns would introduce serious rules precedence issues, hard to track and trust).
Ultimately, I was hoping to create some reusable Rake modules especially designed for ALAN repositories. So, instead of creating ad hoc Ruby classes and functions tailored around the needs of specific repositories, I was hoping to create general purpose functions able to handle the needs of most repositories, where special behaviour could either be activated via custom parameters, or the function being smart enough to deduce the contextual needs by inspecting the source folder and its contents.
As a final note, it's worth mentioning that Rake has some limitation when invoking the sh()
function: its behaviour would depend and vary on the current OS and shell being used (e.g. on Windows different results are to be expected when using the CMD, Bash or PowerShell, for example). So, whenever possible, it's best to rely on Ruby functions for any complex command invocation.
Strange as it might sound, there doesn't seem to be any simple and bullet-proof way to determine under which OS Ruby is running — you might have noticed the hack I inserted at the beginning of the Rakefile to obtain OS info. There is indeed a dedicated gem for detecting the OS, but that would add a dependency which IMO seems absurd to be needed in the first place. In other words, determining the current OS in Ruby requires environment hacks like those found in Shell scripts, which need to take also into account MSYS, MSYS2, CygWin, etc. (hard to believe that a simple OS constant isn't being defined in the Ruby interpreter at compile time).
ARun is not consuming the BOM when the
-u
switch is explicitly set, causing the BOM character to slip through as part of the firs player input command emitted.See:
where the
output_utf8-bom_-u.a3t
transcript contains the BOM:The same problem is seen in
output_utf8-bom.a3t
, which is generated using the same UTF-8 BOM solution file, but without passing the-u
switch to ARun; except that we don't see the encoding problem suggestion as above:Here the BOM in this case is seen as corrupt chars, instead as a binary entity, but that's only because the file is seen as ISO-8859-1 by my editor, whereas the previous one is detected as UTF-8. Maybe this has to do with the byte order of these chars? this might explain the difference seen with
-u
and without it.BOM sanitation should be applied by ARun when auto-detecting UTF-8 solution files, as well as when passing the
-u
switch.I also noticed that when passing an UTF-8 + BOM file, or using the
-u
switch, the generated transcript (redirected to file) will be in UTF-8, but always without BOM. Probably this makes sense, since who would need a BOM in the generated file anyhow?But I was wondering if for consistency sake it might good to have an extra option to enforce the BOM on generated transcripts — I'm just thinking aloud here, and have no particular edge- or use-case in mind.