TestAnything / Specification

Working towards a new TAP specification
89 stars 8 forks source link

Standarize the notation of subtests in TAP #2

Open kinow opened 9 years ago

kinow commented 9 years ago

This issue was originally reported in https://github.com/TestAnything/testanything.github.io/issues/20, and the conversation has been moved here since it is related to the TAP specification.

It concerns the subtests in TAP and a formal standard to be followed by TAP consumer and producers. In the latest TAP specification (TAP 13) there is no formal definition of subtests. There were proposals and draft documents on this, some tools like Test::More and tap4j support with limitations.

Please feel free to chime in and suggest enhancements for this standard. A message is being written to the mailing list pointing to this issue too. The mailing list has several users that participated in the IETF draft and have lots of experience with TAP and its implementations.

/cc @Leont @xmojmr

jonathanKingston commented 9 years ago

One thing that did not seem to be mentioned on the TAP mailing list was that the potential for a sub test to look like this:

ok 1 - something
ok 1 -     something
ok 2 - something

or

ok 1 - something
ok 1 -- something
ok 2 - something

I believe most libraries would support a failure just fine for both of those, its simpler that requiring an update to the libraries and forces stricter reasoning of tests which is a bonus.

Leont commented 9 years ago

We really want to have this. I'm not fond of the current mechanism, but given it's out there and there's a lot of stuff outputting it already it may be a passed station.

Leont commented 9 years ago

I believe most libraries would support a failure just fine for both of those, its simpler that requiring an update to the libraries and forces stricter reasoning of tests which is a bonus.

Both parsers I've worked on would consider those outputs an error. Subtests really should be something legacy parsers consider comments, which fortunately still gives a lot of space.

kinow commented 9 years ago

@jonathanKingston I'm happy with the old proposal for subtests

TAP 13
1..1
ok 1 - Subtest 1
    ok 1 - subtest 2a
    ok 2 - subtest 2b

I'd be fine with other formats, but since there are tools in Perl, Java, Node, which output something similar to that, I'd be more keen to simply formalize it in a TAP 14 specification.

@Leont

We really want to have this. I'm not fond of the current mechanism, but given it's out there and there's a lot of stuff outputting it already it may be a passed station.

+1 I believe we have the formal TAP 13 specification, and a meta TAP 13.5, or TAP 13 + proposals/custom code. Considering the number of users in different programming languages outputting subtests, I think a proposal similar to what most tools produce (and that got consensus here and in the mailing list) would be a good approach.

As a side note, there is an open issue in tap4j where the TAP producer emits the subtests first. The current state machine in tap4j is not able to consume a subtest before. Quoting the example in that issue below:

    ok 1 - subtest 1a
    ok 2 - subtest 1b
    1..2
ok 1 - Subtest 1
    ok 1 - subtest 2a
    ok 2 - subtest 2b
    1..2
ok 2 - Subtest 2
1..2
ok

A perl test that will produce this output is as follows:

use Test::More;

subtest 'Subtest 1' => sub {
    ok(1,'subtest 1a');
    ok(1,'subtest 1b');
};
subtest 'Subtest 2' => sub {
    ok(1,'subtest 2a');
    ok(1,'subtest 2b');
};
done_testing();

It'd be nice to have the specification in cases like this to guide the developer :-)

jonathanKingston commented 9 years ago

We should not be afraid to break old implementations and if anything that should be the correct behaviour. Old implementations should not ever ignore sub tests. This is a testing setup, ignoring tests would be a mistake. Due to the non standard nature of all the current use of subtests it is possible subtests could fail and the parent succeed - this should be caught or be a parse error; either for me is fine. This brings me to another ticket I need to make about how TAP software should respond if it can't be parsed.

I'm a little against looking to satisfy older implementations which we might end up calling 13.5 erroneously. This is a similar situation to what Markdown has been through and where possible I would prefer to go for breaking behaviour and errors will be spotted sooner rather than skipped over.

This was my theory around the formats above which should be treated as safe in most parsers where extra white space would be ignored around the comment.

I would therefore aim for:

I'm actually a fan of the tap4j as it would aid streaming output of the tests before they have completed running.

The other issue with the old proposal was that there is no acknowledgement of sub test count which I think is an issue.

One last point, the number of main consumers and producers is still actually small, patching all of them to support a newer version shouldn't be a large piece of work.

Leont commented 9 years ago

We should not be afraid to break old implementations and if anything that should be the correct behaviour. Old implementations should not ever ignore sub tests. This is a testing setup, ignoring tests would be a mistake.

The way it was designed means that even if the subtests are ignored the correct conclusion can still be made. It leads to a data-loss, not not to incorrect operation.

Due to the non standard nature of all the current use of subtests it is possible subtests could fail and the parent succeed - this should be caught or be a parse error; either for me is fine.

In the parser I wrote a discrepancy generates a non-fatal parse error.

I'm a little against looking to satisfy older implementations which we might end up calling 13.5 erroneously. This is a similar situation to what Markdown has been through and where possible I would prefer to go for breaking behaviour and errors will be spotted sooner rather than skipped over.

Do we really have that problem?

This was my theory around the formats above which should be treated as safe in most parsers where extra white space would be ignored around the comment.

I don't follow you. They don't have extra whitespace at the start of the line, so they won't be ignored even on legacy parsers.

jonathanKingston commented 9 years ago

The point I am making is we are expecting the producer to be error free by always returning a not ok parent test. I would prefer if both producer and consumer were more fragile rather than the 'Just cope' attitude the philosophy suggests.

This was my theory around the formats above which should be treated as safe in most parsers where extra white space would be ignored around the comment.

I don't follow you. They don't have extra whitespace at the start of the line, so they won't be ignored even on legacy parsers.

That is my point, why would you want them ignored. I want a testing framework to do one of two things when it doesn't know what to do:

Older parsers may or may not be able to cope with invalid subtests. The following will not fail with some of the parsers I have seen:

ok 1 - test
  not ok 1 - test
ok 2 - test

However both these would:

ok 1 - test
not ok 1 -    test
ok 2 - test

and

ok 1 - test
not ok 1 -- test
ok 2 - test

The way it was designed means that even if the subtests are ignored the correct conclusion can still be made. It leads to a data-loss, not not to incorrect operation.

This is unlikely to be true for all producers but I take your point that the parent test should fail however I would prefer the correct behaviour for parsers is that they fail.

In the parser I wrote a discrepancy generates a non-fatal parse error.

This is unlikely to be true for all consumers, in fact the TAP philosophy (which likely needs addressing) suggests any such issue should be skipped.


The only way I would support this with the subtests being before like @kinow mentioned. Even then the format isn't as easy to read.

Leont commented 9 years ago

Older parsers may or may not be able to cope with invalid subtests. The following will not fail with some of the parsers I have seen:

Even as OKs they would fail because the numbering is incorrect.

This is unlikely to be true for all producers but I take your point that the parent test should fail however I would prefer the correct behaviour for parsers is that they fail.

Agreed. Parsing only the parent test should clearly be considered legacy behavior, but legacy has its uses.

This is unlikely to be true for all consumers, in fact the TAP philosophy (which likely needs addressing) suggests any such issue should be skipped.

My take on the philosophy would be: "report errors, but when possibly make them non-fatal", as that way the user can be presented with as much information as possible.

jonathanKingston commented 9 years ago

Very true however subtests could be part of the same numbering potentially.

Is there other useful meta data that subtests have?

Yeah I will raise an issue also with that as I certainly don't think it is clear enough.

jonathanKingston commented 9 years ago

@ovid I saw some interesting comments on the mailing list of yours. (http://www.ietf.org/mail-archive/web/tap/current/msg00530.html) Would you be able to mention your concerns here?

I believe a way of demarcating subtests is needed and it should satisfy these conditions:

Leont commented 9 years ago

Allow for sub test numbering 1.1.1 and 1

I'm not sure what you mean with that exactly.

jonathanKingston commented 9 years ago
ok 1 - something
ok 1.1 -- something
ok 1.1.1 --- something
ok 2 - something

and

ok 1 - something
ok 1 -- something
ok 1 --- something
ok 2 - something

Obviously numbering itself should be optional also.

Ovid commented 9 years ago

@jonathanKingston: The approach has been suggested before and always shot down. The problem with this approach is that it is not backwards-compatible. Maintaining this backwards compatibility is a key design goal of TAP.

jonathanKingston commented 9 years ago

@Ovid it depends how you define backwards compatibility, therein lies my issues with the TAP philosophy. Where the new format includes further information about the tests, TAP consumers should of course always succeed.

However in this case we are transferring further tests, we should be doing our best to make all previous consumers to assume the tests are top level tests.

Subtests should in older consumers either:

It would not be acceptable for the following to be treated as an overall pass in older consumers:

ok - test
    not ok - test
ok - test

What I think this means is that the choice between any subtest format is:

Ultimately if backwards compatibility were a priority then the TAP version would not have been put in place in the first instance. In my opinion truly backwards compatible structures would not put in version strings.

The problem around backwards compatibility here restricts subtest numbering completely (However looking at the specification numbering is very loosely defined itself, it doesn't even state test numbers are incremental), by adding in numbering when we have the tests any numbering policy would be problematic also.

I think the benefit here is that we can define subtests without the risk of breaking older consumers - false negatives are better than false positives when it comes to tests themselves.

Leont commented 9 years ago

Cause a parse error (Getting the consumer to upgrade their consumer)

This is not acceptable at all. Specially not in situations where the producer and the consumer are distributed separately. If the consequence of upgrading your producer is breakage, the most sensible decision is to not upgrade that producer. This sort of thing means people will refrain from upgrading to TAP 14.

Add in top level tests (Allowing test failures to be treated as failures if they are)

Both of your suggestions will cause a test harness to fail despite the tests all being succeeding. Again, if upgrading a producer causes the toolchain to fail when the tests pass, the sensible thing is to not upgrade the producer.

It would not be acceptable for the following to be treated as an overall pass in older consumers:

That is already the situation today. And it's not a major issues because it only happens on a faulty producer. In a green-fields implementation this would make sense, but given the reality we can't enforce this for legacy consumers.

jonathanKingston commented 9 years ago

I'm sorry but I really am not seeing the adverse reaction to a testing framework that requires you to upgrade your consumer when you decide to want subtests.

All ideas have their pitfalls so far. Its a testing framework not a GUI tool or phone system we should not be trying to remedy old frameworks. Testing frameworks need to report errors not be ignored accuracy should be the key.

It is not really greenfield at all as some have gone with the suggested wiki post / mailing list. Some have gone with the tap4j backwards format. Some have invented YAML subdocuments and etc. The best we can do is cause errors and explain why by incrementing the version number.

The assertions made are that the TAP output is read by one consumer for the one producer and that won't always be the case.

I'm not precious to any of the suggested structures I have given at all however I am precious to allow the safest failures for older frameworks. You mention that test harness will fail and yes ideally we should look for a structure that has the least issues in the most popular consumers however as I keep saying we should not be worried about causing false positives, false negatives would be far worse.

Ideally I would prefer an implementation that breaks all old implementations to force the consumer upgrade if subtests are used.

If you were upgrading a framework of any kind in a corporate setting, the best is to assume that everything interacting with the framework will break. If we make TAP 14 compelling enough then slowly people will upgrade. Upgrades in all software needs to be compelling enough as likely enough there will be issues.

What I suggest is we keep looking, I really don't think it is ok that the best worst of the indentation is taken up. This causes YAML parsing issues for simplistic parsers.

I'm back to lets throw as many examples at consumers until we find the most that either:

Can we at least keep looking/trying for a solution here, I think it is pretty obvious why no solution was ever chosen. Slumping for the first suggestion we know seems a little lacking.

Ovid commented 9 years ago

@jonathanKingston: I know this must be frustrating for you and I'm sorry about that. We're very, very cautious about this because TAP is the default test protocol used in the core Perl language. Perl is distributed by default with just about every major operating system except windows. This means that in just about every city on the planet with modern computers, Perl is there and thus TAP is there.

There is no way I can estimate the entire scope of breakage (many of those default installations will never be touched), but there have been serious issues in the past with accidental toolchain breakage causing an uproar. And then there's the ripple effect of this change spreading out to other language test harnesses which produce and consume TAP and there would be yet another uproar (and I'm pretty sure it would kill TAP or cause a fork).

I see no compelling reason for this change, but the downside is that it would be a disaster and devs in countries we've never thought of would be staring at their monitors, trying to figure out what happened. The first rule of the toolchain is that you do not break the toolchain (we Perl toolchain devs have been bitten enough times that we're learned our lessons).

As for how to handle YAML diagnostic information, I thought it was to be indented four spaces from the current line, but there's a potential problem there (the following is completely made up and just used as an example):

ok 1 - some test
ok 2 - some test
    1..3
    ok 1 - some subtest
    not ok 2 - some subtest
        ---
        have: 7
        want: 8
        line: 7
        file: t/foo.t
        extra:
               - |
                   This is a YAML here document and maybe it's
                   not ok to embed this in test diagnostics
    ok 3 - bummer
not ok - 3 - subtest summary line
ok 4
1..4

In the above example, I could easily see a poorly-written parser choking on the YAML heredoc. That will need to be specified carefully and I don't see how the suggested changes which break the parser would mitigate the above case.

To help us move forward on this, please show us concrete examples of:

Without seeing specific examples, we'll get this wrong.

As an aside, is anyone reading this who was also at the Oslo QA hackathon in 2008? We hashed out a lot of this there, but more importantly, there was discussion of creating "spec tests" in YAML or XML format that parsers written in any language should be able to parse and validate. That would allow TAP consumer authors to have a standard against which they could test. Does anyone recall what happened with that? I wasn't working on the spec tests at that event.

kinow commented 9 years ago

@Ovid

As for how to handle YAML diagnostic information, I thought it was to be indented four spaces from the current line

I also recall reading it somewhere, probably tap4j expects YAMLish in a indented block of text, starting with --- and ending with ...

kinow commented 9 years ago

Since the specification has been frozen for a while, a TAP 14 which breaks backward compatibility could be dangerous and reduce the number of adopters, IMO.

Though I agree with @jonathanKingston that we need to move further and not be afraid of changes, at this time maybe it would be more sensible to release a specification that could be easily adopted by implementations, and would let us measure its level of adoption by implementations and gather new contributors and users.

@jonathanKingston in TAP 13 old website, I think there was a section Proposals. Maybe we could create similar page in the website with sections for TAP 14, TAP 15 and Backlog (or others) and write what we would be aiming to release with each new specification. Kind like a roadmap. What do you think?

xmojmr commented 9 years ago

@kinow :+1: public product backlog spreadsheet prioritized for the TAP 14 draft sprint looks like useful deliverable

This need was already expressed using different words in Leont's comment, my comment, Ovid's comment

Where it should be hosted (GitHub Wiki, Google Docs, GitHub Issues, Trello, ...) and how would the review and voting processes look like is an unknown for me. I don't have corresponding positive open source experience and The IETF Process: An Informal Guide is not very digestible guideline

@jonathanKingston ?

kinow commented 9 years ago

Thanks @xmojmr! Just created issue #10 for it, so that we can move this discussion there.

jonathanKingston commented 9 years ago

@Ovid to be clear it is not me who actually wants this feature, however it would be a useful separation of the TAP output to see categorisation of the issues.

What I am trying to prevent is causing more issues for the future of TAP by adding in the simplest option we can pick from which is whitespace indenting. Yes YAMLish formats have the indicators but I would prefer parsers to be able to ignore all indented text unless it spots either bail out or not ok.

The example code you gave explains my distaste in standardising the old proposal of indenting the subtests, I want to keep parsing rules as simple as possible.

I would be interested in seeing your idea around "spec tests" even if it is from memory and not exactly how it was suggested. As I said I'm not really suggesting anything besides keeping the TAP format simple whilst allowing flexibility.

@jonathanKingston: I know this must be frustrating for you and I'm sorry about that. We're very, very cautious about this because TAP is the default test protocol used in the core Perl language. Perl is distributed by default with just about every major operating system except windows. This means that in just about every city on the planet with modern computers, Perl is there and thus TAP is there.

I really don't buy this statement sorry, software can be updated carefully. Updating specs is possible even in the most widely used languages. I get that breaking changes will happen but that is the risk of updating the producer. This is the same reason why old browsers are prevalent and PHP 4 in enterprise setups.

A fork has ultimately happened already happened with TAP-Y etc, without progressing the format to include the latest user requirements adoption will slowly die out. If TAP is so ingrained into the Perl toolchain that it can't be extended then ultimately it shouldn't have been opened up outside of the libraries that consumed it. If the original tools that created TAP want to remain using the older version, that is completely fine also. I think the only way to progress is as @xmojmr suggested creating guidelines for extending TAP and a framework where people can vote on these issues.

@kinow I agree TAP 14 should just be a clean up of the spec where possible breaking backwards compatibility should be 15 (unless we opt for point increases which isn't a bad idea either).

@xmojmr yep moved all talks of a roadmap to #10 thanks for all the previous comments on the previous thread also :+1:

Leont commented 9 years ago

The example code you gave explains my distaste in standardising the old proposal of indenting the subtests, I want to keep parsing rules as simple as possible.

I think that is a very reasonable POV, I just don't think your suggested resolutions are workable. I'm sure there are syntaxes that would be ignored by a TAP12 consumer but are easier to parse for a TAP14 consumer than the current mechanism. This is a solvable problem. For example.

case A subtest {
    ok 1 - A
    ok 2 - B
}
ok 1 - A subtest

or

ok 1 - A subtest {
    ok 1 - A
    ok 2 - B
}
jonathanKingston commented 9 years ago

@Leont again I am back to thinking it doesn't need to be ignored so long as there is a failure but lets not get into that.

Both those examples then need the parser to be a node based parser rather than a simple one test per line that starts with /^(not|ok)/. If I am building a parser that complex then I can filter out the YAMLish start and end.

Prefixing with a character might be easier (despite not meeting my desire to have subtests parsed by older consumers):

ok 1 - parent
> ok 1 - child
> ok 2 - child
> > ok 1 - child

I'm not a massive fan of that either but it's certainly simpler than context based parsing.

exodist commented 9 years ago

I know I am late to the party, but @Leont just looped me in (kind of). Before I knew this discussion was going on, in fact it was early into my adoption of Test-Simple, I decided to build something like this in as an option into Test::Builder. As it stands the dev versions have the following as a subtest syntax you can turn on (default is the classic)

ok 1 - not a subtest
ok 2 - a passing subtest {
    ok 1 - inside a subtest
    ok 2 - inside a subtest
}
not ok 3 - a failing subtest {
    not ok 1 - failure inside
    # diag stuff
}

This doesn't break any of the harnesses/parsers I have used it on.

Please do not take this available syntax as me going cowboy or trying to preempt anyone here. I did this before I knew there was an ongoing process or other people to discuss it with. The main takeaway should be that the dev releases now make it possible to support a new subtest syntax if one is decided. Also it is now possible to put the subtest line before the results from inside the subtest if that is necessary.

jonathanKingston commented 9 years ago

I'm still back to sub tests would be better not indented if possible and consumers would have simpler memory usage if a state wasn't required to know the nesting.

The YAML causes a simplistic consumer to need to check for the YAMLish lines rather than just ignore indented text.

The context based approaches are better avoided (crude but similar to):

ok 1 - thing (1) {
ok 1 - sub thing
}

@exodist thanks for the heads up, I would like to make sure the format is correct before it becomes close to being in TAP14. Like you mentioned in the other ticket it would be easy to miss the progress in this repo.

beatgammit commented 9 years ago

I'm still back to sub tests would be better not indented if possible and consumers would have simpler memory usage if a state wasn't required to know the nesting.

I agree.

I've seen some sentiment about wanting backwards compatibility, but I really don't understand that requirement. Obviously you'd upgrade your consumer before your producers, so I don't think the design should be limited by that.

That being said, I think I like this approach:

ok 1 - thing
ok 1.1 - sub thing
ok 1.1.1 - sub sub thing
ok 1.2 - sub thing 2

If you don't want to number your tests, perhaps this would work:

ok - thing
ok . - sub thing
ok .. - sub sub thing
ok . - sub thing 2

The context based approaches are better avoided (crude but similar to):

Exactly. This can get nasty with several levels of nesting (assuming no indentation):

ok 1 - thing (1) {
ok 1 - sub thing (2) {
ok 1 - sub sub thing {
}
ok 2 - sub sub thing {
}
}
}

If we are going with something that requires context, I would much prefer indentation as it's much easier to read.

Ovid commented 9 years ago

@exodist has offered pretty much the first syntax I've seen which does not break backwards-compatibility, but it has some issues (which I'll cover in the moment).

Many suggested alternatives break backwards-compatibility because:

In other words, millions of programs all over the planet which might parse TAP could fail, even though the tests pass and the software works. This is not a tiny project which a handful of fans have adopted. If there is a desire to break backwards-compatibility (and that would create a sh*tstorm), please:

Some of that has been attempted above, but the theoretical discussions I've seen are far from compelling. There's been some discussion about leading characters other than spaces and that also confuses me a bit: why is indentation OK so long as it's not spaces? (That's an honest question. I would love to understand the rationale and I probably misunderstood something here).

Regarding the syntax Chad has proposed:

ok 1 - not a subtest
ok 2 - a passing subtest {
    ok 1 - inside a subtest
    ok 2 - inside a subtest
}
not ok 3 - a failing subtest {
    not ok 1 - failure inside
    # diag stuff
}

I don't see any leading or trailing plans. How does parser know that only two tests were supposed to be run in the first subtest and not three tests? The trailing curly brace doesn't say, but an asserted plan does say.

The other issue I have with this is more subtle.

When I added subtests, I put the subtest summary test after the subtest. Aesthetically I prefer Chad's version, but the summary test came after for one simple reason: you block on test output. I worked on one system which used a hideous XML format to encode test output. Your test run might take a long time, but even if the first test failed you wouldn't know because the parser had to wait until the entire document loaded (often the entire test run) to see the failure. This destroyed the rapid test/hack/test/hack/test/hack cycle. You ran a test and went to get coffee. TAP is deliberately line-oriented, not documented oriented. You can easily stream lines. Documents? Not so much.

If the summary line is presented first, the entire subtest has to run but its output blocked because you have to finish the subtest, determine if it passed, print the summary TAP line and then flush the output. If subtests contain subtests (and I'm currently working on a system for which deeply nested subtests are the rule rather than the exception), you can have huge chunks of TAP output blocked, waiting for all of the subtests to pass. This destroys the streaming nature of TAP.

exodist commented 9 years ago

@Ovid, first thing, the plan, I hand-wrote the example in my last post, and I forgot to include a plan, but the syntax as it is implemented does handle plans.

I also must state that I was not proposing that syntax, I was just mentioning we already have it due to my ignorance of this debate.

Moving the final result to the top had very little to do with aesthetics. The reason I put it at the top was actually to solve the jumbled output problem that occurs when tests are run in a multi-thread/multi-proc way. If you have 3 or 4 child processes producing results, and they use subtests, the output becomes less than useful. For a singlet-process/thread test it makes sense to use the old subtest syntax, but it simply does not work when you add concurrency to the mix.

I have been working with unit tests that support concurrency for 5+ years now, first with Fennec, and now with the baked in fork/thread support in Test-Simple. Subtests as they are currently are simply insufficient when concurrency is involved. Delaying the output of results within a subtest, and displaying them with the final result is the most sane way I found.

If the TAP is intended for a machine then the delay doesn't matter, the machine has a lot more patience than a human. If the TAP is intended for a human then it needs to be fast, but it also needs to make sense, and subtests where the final result comes last have proven to be very unfriendly to humans who have frequently expressed confusion.

Finally, the way Test-Simple implements it you can actually have BOTH, it is possible to tell it to enable both forms:

ok 1 - not a subtest
    ok 1 - inside subtest
    ok 2 - also inside
ok 2 - final subtest result {
    ok 1 - inside a subtest # again
    ok 2 - also inside # again
}

This lets you see results inside subtests as they happen, but once completed you can see it in full context. A harness could be trained to ignore indented results outside of brackets, thus letting us use this dual form to provide instant feedback for humans, but useful and contained data for both machines and humans. If we wanted to make it less ambiguous we could prefix the indented ok's that occur outside the blocks with something:

ok 1 - not a subtest
    s'ok 1 - inside subtest
    s'ok 2 - also inside
ok 2 - final subtest result {
    ok 1 - inside a subtest # again
    ok 2 - also inside # again
}

*Note: Once again I did not put the plan in there anywhere, that was simple lazyness.

rsrchboy commented 9 years ago

I hesitate to chime in here, but... as @Leont said in one of the very first comments:

We really want to have this. I'm not fond of the current mechanism, but given it's out there and there's a lot of stuff outputting it already it may be a passed station.

Sure, the current subtest output is arguably not ideal, but it's there and shown to work. It includes a plan, doesn't conflict with the current TAP spec, and is wonderfully simple. That is: the subtest is just another TAP; strip the indent and feed it to a parser and it should work nicely.

ok 1 - hi!
    # Subtest: a subtest
    ok 1 - la al al
    ok 2 - fo fo fo
    1..2
ok 2 - a subtest
1..2

Is there some benefit that we would gain by rejecting this simple, logical -- and in use -- extension in favor of some other deeper change?

jonathanKingston commented 9 years ago

@Ovid the main reason for the 'change' is that lots of producers and consumers have support for this however it has never been documented or official in any way. The benefit of making it part of the format is that scripts are less likely to just randomly break changing consumers and producers.

Without any clear documentation of the current implementations, there is enough non standard support people actually think its a native format feature which is not good news for standardising TAP.

The main reason against whitespace prefix is to stop the confusion in parsers checking the YAML for subtests, for example:

ok 1 - test
   ---
   userinput:
     ok I'm not sure if this will be an issue here
   ...
ok 2 - test
    ok 1 - tests happened here

If this were instead as an example:

ok 1 - test
   ---
   userinput:
     ok I'm not sure if this will be an issue here
   ...
ok 2 - test
____ok 1 - tests happened here

To match test lines would simply be: /^[_]*(ok|not\s+ok)/ rather than something that was far more context aware.

I'm back to the, we are going to make a breaking change trying to standardise all this anyway so we are better getting it right. The lure to the new version will be all implementations will be more robust over the non standard behaviour of lots of different implementations.

@RsrchBoy as mentioned above, this is not standard some producers just happen to support that. As others have mentioned different producers decide to support different orders, numberings, plans etc.

jcelliott commented 9 years ago

In other words, millions of programs all over the planet which might parse TAP could fail, even though the tests pass and the software works.

I don't understand this. Continuing development of the TAP standard isn't going to break anything. Producers and consumers that use the current version of TAP will continue doing so. To be honest, I don't know much of the history of TAP and am certainly not an expert; but this stance seems to be equivalent to saying we shouldn't continue developing the ECMAScript standard because some browsers won't support it.

beatgammit commented 9 years ago

Is there some benefit that we would gain by rejecting this simple, logical -- and in use -- extension in favor of some other deeper change?

Apart from what @jonathanKingston said, simple indentation makes parsers more complex because they need to maintain state. If backwards compatibility is a strict requirement, then there are even more complications because TAP 13 doesn't specify how many indents a YAML document is, so a TAP 13 parser may incorrectly assign the YAML document to the outermost parent:

ok 1 outermost document
  ok 1 sub test
    ---
    yaml: document
    ...

In this case, it would be better to ignore the YAML document entirely. TAP 13 specifies that YAML must be the first line after a test, but it doesn't specify how unrecognized lines are to be handled. A weak parser (which I assume many are) may silenly ignore the sub test and accept the embedded YAML document as describing "outermost document".

I'm back to the, we are going to make a breaking change trying to standardise all this anyway so we are better getting it right. The lure to the new version will be all implementations will be more robust over the non standard behaviour of lots of different implementations.

I completely agree. It should be backwards compatible to the extent that TAP version XX works, but beyond that, I think we should leave as much open as possible.

I am partial to my proposed ok 1.2.3 <description> syntax for subtests since it state is inside the test line and it's easy to read. This is reasonably backwards compatible in that parsers should work if they match like this: /ok (\d)+/, and everything from the first . and after will be considered part of the description. This will only work if parsers are lenient with test numbers (e.g. two tests named ok 1 won't break it). This wasn't part of the design goal, but is a nice side-effect.

In any case, I don't think TAP 14+ should be limited by backwards compatibility. That's what the TAP version XX line is for.

However, I don't work for a large company with a large TAP infrastructure, so I don't know what practical limitations there are, but it seems that updating a consumer should be quite painless.

exodist commented 9 years ago

I like the 1.2.3 syntax for the most part. But my problem with it comes down to an issue I touched on, but probably did not convey the importance of clearly. Concurrency. If your test uses threads or child processes, and both can produce TAP, you get to a tricky situation with subtests. The number is typically assigned at the time of rendering, but if results inside subtests can be rendered before the final subtest then you need to assign it a number early, then any parallel results need toblock until they can render with their number, which comes later, and cannot be out of order according to TAP specs.

Also, would we define strict ordering of results? IE

ok 1
ok 2.1
ok 2.2
ok 2.3
ok 2

vs

ok 1
ok 2
ok 2.1
ok 2.2
ok 2.3

Would both be accepted?

Note, this problem would disappear if TAP no longer required OK's to be in order, but instead allowed them to come in any order so long as no numbers were missing.

I mean I can make Test::More output whatever we need in any order we need, and it is only slightly more complicated, but dropping the ordering requirement, at the very least for subtest innards vs final ok would simplify the job for produces by a good bit.

exodist commented 9 years ago

oh, I should also note that I do not personally care about the backwards compatibility debate. When it comes to my own personal things I am fine with breaking a couple things here and there to make valuable progress.. However Test-Simple is another story, I don't have the liberty of breaking things on a whim. That said if a new TAP standard comes out Test-Simple will of course have support for it, though if it breaks things it will need to be a support people ask for, not enabled by default.

beatgammit commented 9 years ago

From what @exodist brought up, I think we need to decide on the meaning of the test line for a test with sub-tests:

# how do we know this has failed if we're streaming the results?
fail 1 test case
  fail 1 sub test
  ok 2 sub test

I think we have three options here:

I think the first two are the only real options since the last prevents streaming. None of the options directly require breaking backwards compatibility, so I think we can decide on this before we decide whether the syntax will break backwards compat.

exodist commented 9 years ago

What I would like the biggest takeaway from my comment is this: Concurrency is coming, in many ways it is here. In order to support concurrency I think the next TAP specification should drop the requirement that tests appear in order. I think that relying on order is a mistake. Producers should be free to output results in any order they want. The only requirement should be that if you use numbers, no numbers should be missing between your first and last.

This also makes the question of putting results inside subtests before or after the final subtest result a non-issue. Producers should be free to output both in whatever order makes the most sense. This also makes concurrency a much more simple prospect for producers, they only need to ensure test numbers are unique, they do not also need to enforce an order.

exodist commented 9 years ago

Let me provide an example scenario (psuedocode/perl)

thread {
    subtest subtest_a {
        ok 1, "a - first"
        ok 1, "a - second"
        sleep 10; # ensure this test finishes last
    }
}
thread {
    sleep 2; # Ensure this test is the second one
    subtest subtest_b {
        ok 1, "b - first"
        ok 1, "b - second"
        sleep 2; # Ensure both tests produce results before either finishes
    }
}

So what we have here are 2 parallel subtests, the sleeps ensure that the second one to start will finish first. We also ensure both can output results before either complete.

In current dev releases of Test::More the output would look like this:

    ok 1 - a first
    ok 2 - a second
    ok 1 - b first
    ok 2 - b second
ok 1 - subtest_b
ok 2 - subtest_a

Obviously this is no good for a consumer, it has no way to know which indented oks go with which final oks. Because of this I absolutely advocate for a change from the current conventions and syntax.

I don't need to give examples for how this looks with the alternative Test::More is currently capable of, the nested results are in brackets and everything just works. The problem is that it "breaks streaming" by refusing to render some results until others complete.

So lets look at the 1.2.3 format, which I happen to really like now that I have thought about it:

ok 1.1 - a first
ok 1.2 - a second
ok 2.1 - b first
ok 2.2 - b second
ok 2 - subtest_b
ok 1 - subtest_a

Uh oh... test 1 and test 2 are out of order! Why? because in order to render 1.1 we had to assign the first subtest the number 1. However the first subtest actually finishes much later than the second concurrent subtest. We can fix this by refusing to render 'ok 2' until after the first subtest finishes... wait, this "breaks streaming" again!

We then also need to consider, can 'ok 2.1' be rendered before 'ok 1' is rendered? or is that ALSO an out of order error?

So if we are going to support concurrency at all in tap we have 2 choices: break streaming, or accept numbers out of order and simply enforce that none be missing.

I personally do not care what option we take to support concurrency, but I strongly urge that we do.

jonathanKingston commented 9 years ago

@exodist I'm in agreement with you on this regarding concurrency. The support should be there for once all the tests have complete that the tests are all there. However order should not matter at all.

Adding the point values for subtests minimises the changes needed for support and satisfies the YAML whitespace issue. The only problem with this is that sub tests can't have plan information easily, this also impacts the concurrency issue too.

ok 1.1.1 - thing
ok 1.1 - thing
1..1
ok 1.1.2 - thing
ok 1 - thing

There is no obvious place to then put subtest numbering and actually forcing numbering isn't helpful either as often developers are forced into numbering tests they have found which defeats the point of having numbering in the first place.

This is an important issue for concurrency as nothing is more frustrating than lost tests - so that is another spanner in the works that would need fixing.

exodist commented 9 years ago

It is too bad plans are 1..X and not 0..X which would allow us to put the subtest number for the plan

0..2
1..2
1.1..2
ok 1.1.1
ok 1.1.2
ok 1.1
ok 1
ok 2

but changing the overall plan from 1 to 0 is probably a no go eh?

Alternatively we can add a 'plan'' keyword, and if it is omitted we assume an older TAP version?

plan X 1..Y

X can be the subtests number, Y is the number of tests in the plan. If X is 0, or omitted it is the overall plan.

But the idea I really like is a generic way to associate a line with a test/subtest:

1..2
1: 1..2 # plan for subtest with number 1
1.1: 1..2 # Plan for subtest inside a subtest
ok 1.1.1
ok 1.1.2
ok 1.1
ok 1.2
not ok 2
2: # diag for test 2
beatgammit commented 9 years ago

I like @exodist's last example the best so far. The colon is so far unused syntax and it's pretty common as a label.

exodist commented 9 years ago

I would also point out that the colon form lines will be ignored by old (current) parsers, making the plans safe on older harnesses. The only question is how they will handle the 1.2.3 numbers in test, if they ignore them we are all set since they would be ignored just like previous subtest stuff. If they are not ignored then we break back-compat. In this case I think breaking back-compat is worth the gains. Specially since this format is by far the most suitable for concurrency.

beatgammit commented 9 years ago

The only question is how they will handle the 1.2.3 numbers in test, if they ignore them we are all set since they would be ignored just like previous subtest stuff. If they are not ignored then we break back-compat.

My concern is how older parsers will handle the plan since it's not going to add up:

1..2
ok 1.1
ok 1.2
ok 1
ok 2

If a parser strictly checks the plan, then they'll get 4 tests when expecting 2, and three tests named 1. honestly hope existing parsers throw an error on this. "Fixing" existing parsers means completely ignoring the plan and test numbers...

I do agree that breaking backwards-compat is worth it for the gain. Supporting this format requires very little addition to existing parsers, which should certainly help adoption.

Ovid commented 9 years ago

Just so we know:

$ cat test.pl
print <<"END";
1..2
ok 1.1
ok 1.2
ok 1
ok 2
END

And then running through Perl's Test::Harness via the prove command:

$ prove -v test.pl
test.pl .. 
1..2
ok 1.1
ok 1.2
ok 1
ok 2
All 2 subtests passed 

Test Summary Report
-------------------
test.pl (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  1-2
  Parse errors: Tests out of sequence.  Found (1) but expected (2)
                Tests out of sequence.  Found (1) but expected (3)
                Tests out of sequence.  Found (2) but expected (4)
                Bad plan.  You planned 2 tests but ran 4.
Files=1, Tests=4,  0 wallclock secs ( 0.02 usr +  0.00 sys =  0.02 CPU)
Result: FAIL
exodist commented 9 years ago

Someone suggested off the thread that we have Test::Harness provide an env var to tell Test::More that it supports the newer TAP, this would give us a way to let Test::More remain backwords compatible while also letting it use the new syntax. I would have Test::More use legacy output if there is a harness, but ti does not provide the new env var. If there is no harness (we already have an env var for this) it will default to the new syntax.

Leont commented 9 years ago

Someone suggested off the thread that we have Test::Harness provide an env var to tell Test::More that it supports the newer TAP, this would give us a way to let Test::More remain backwords compatible while also letting it use the new syntax.

I was pondering along the same lines. This probably deserves a ticket of it's own, as it has rather wide implications.

jonathanKingston commented 9 years ago

Someone suggested off the thread that we have Test::Harness provide an env var to tell Test::More that it supports the newer TAP, this would give us a way to let Test::More remain backwords compatible while also letting it use the new syntax.

I was pondering along the same lines. This probably deserves a ticket of it's own, as it has rather wide implications.

Exactly that, this isn't the discussion of the upgrade path or supporting backwards compatibility, I was close to opening this before but I think it is worth it's own issue now: #13

Lets keep this to discussion of sub tests, so far I think the best candidate we have is the point based numbering for simplicity. The problem I see is that we don't have a notion for subtest without numbering, I would like to suggest the following:

ok x.x - Something happened
ok x.x - Something else happened
ok x    - Something happened
ok 1.x - Something happened

Where the consumer wouldn't be able to infer the relationship between parent and subtests as @exodist suggested it is most important that all consumers can handle results in any order.

I would also like to clarify in the spec any amount of whitespace is acceptable between result, number and description. I also think we could standardise the use of the dash which is not required in the output which should be helpful in alignment as shown above.

The alternative for subtests without numbering is using dashes which should also be backwards compatible:

ok -- Something happened
ok -- Something else happened
ok -  Something happened
ok -- Something happened
exodist commented 9 years ago

It occurs to me that the '#:' prefix system would not work when numbers are disabled. So instead of using the prefix for subtest plans I suggest instead we allows for a subtest plan to be prefixed with a sequence of s's.

1..2
ok - not a subtest
s1..1
ss1..1
ok .. - 2 subtests deep
ok . - result of the nested subtest
ok - result of outermost subtest

if a plan has no s's it is top level plan, if it has s's, the number of s's tells you how deep it is.

I added #15 to address the concurrency issue and a potential prefix system for grouping lines of output to their thread/process.

Leont commented 9 years ago

The more I think about it, the more a flattened structure (a bit like TAP-Y) with a single counter makes sense to me. E.g.

ok 1 - Setup
case 2..3 Foo
ok 2 - Some subtest
ok 3 - Other subtest
1..3

Too bad it screws up existent testcounts :-s

isaacs commented 9 years ago

Hi, I maintain a test library for Node.js that outputs TAP http://npm.im/tap

I apologize for re-iterating some of what's been said above, but I wanted to chime in here and let you know why I'm planning on just copying Test::More unless someone can provide a better solution. (And if that means it ends up "not valid" or something, then I don't care much, unless "valid" is "at least as good as just copying Test::More".)

When writing tests, whether with perl or node-tap or any other test framework, it's common to do stuff like this:

plan(2)
assert.equal(a, b, 'these should be equal')
subtest('the second thing is actually many things', function (child) {
  child.plan(2)
  assert.equal(a, b, 'should still be equal')
  assert.equal(1, 1, 'math might be broken')
})

In node-tap, I went through ill-advised-in-retrospect approach of representing this sort of thing as:

ok 1 - should be equal
# the second thing is actually many things
ok 2 - should still be equal
ok 3 - math might be broken
1..3

which as you can imagine requires a lot of juggling and counters and whatnot.

It'd be easiest to instead follow Test::More's pattern and do:

1..2
ok 1 - should be equal
    1..2
    ok 1 - should still be equal
    ok 2 - math might be broken
ok 2 - the second thing is actually many things
  1. Knowing the future is not required, so the output can be generated as the tests are run. Streaming is super important when you have a single slow test that makes the whole thing lag.
  2. Because of (1), fancier pretty runners can provide more details to human users in real time.
  3. The output makes it clear what plan was set when, and to what value. No math!
  4. If someone runs this through jtap or some other system I don't control, it won't flip out about passing tests.
  5. There's not a lot of magic chars. Humans see nesting and know immediately what it means.

Simplicity both in the implementation, human-parsing, and machine-parsing is all super important here. Test frameworks need to be trustworthy code for obvious reasons. And while it's sometimes good to smash all the old parsers, that's just not practical, and will mean that it is not adopted. No matter how good it may be, building a thing no one uses is more expensive and no more useful than not building it. So, for node-tap at least, I'm gonna not build that.

I propose the following criteria for any subtest approach:

  1. It should be strictly line-based.
  2. Each line should be able to be printed without knowing the lines that come after it.
  3. Breaking old parsers should be minimized. Since most followed the guidelines that any non-TAP lines are treated as comments and ignored, this isn't hard to do. (In my experience, only the yamlish bits, or stuff that looks like TAP but is invalid, is a problem.)
  4. Conforming new parsers MUST fail a parent test when any subtest fails.
  5. Numeric test ids should be scoped to the child test (ie, a ok 2 in a child test set doesn't collide with ok 2 at the top level, or in a sibling test set under another parent).
  6. Plans at each level should refer to the tests at that level. Plan-arithmetic must be avoided.
  7. A child test block should be passed through a trivial per-line transform to be back to plain old regular TAP. (Indentation satisfies this; dot-separated numeric IDs does not.)

Some suggestions above that do not satisfy these criteria:

1..2
ok 1 - i can see the future
ok 2 - can't know this works until we run the next whole child set {
  1..2
  ok 1 - should still be equal
  ok 2 - math might be broken
}
1..2
ok 1 - not a trivial transform
2.1..2.2
ok 2.1 - should still be equal
ok 2.2 - math might be broken
ok 2 - my child is not tap
# Also this may *badly* confuse naive parsers who see a
# plan of 2 tests, followed by 4 "ok" lines!

Another suggestion from earlier in this thread that also follows these guides:

1..2
ok 1 - should be equal
{
  # the indentation here is important
  1..2
  ok 1 - should still be equal
  ok 2 - math might be broken
}
ok 2 - the second thing is actually many things

The indentation is important, because it prevents older parsers from getting upset about the re-used numeric id's, or the re-appearance of a plan. But I'm sure that there are other approaches that satisfy the 7 criteria above.

The } should ideally be on a line by itself, at the same level as the parent. I really don't want to do brace-counting and handle stuff like } } ok 2 - double brace what does it mean.

This should be considered invalid TAP by 14 parsers, and failed on:

# parent passes, children fail
1..2
ok 1 - should be equal
  1..2
  ok 1 - should still be equal
  not ok 2 - math is indeed broken
ok 2 - but the subtests were not ok!

However, I'd like to be able to fail a parent test, even if the child run is ostensibly passing. This allows for exit code analysis in runners, when the child proc outputs passing TAP and then throws an error or dies terribly.

With a format like that, I can trivially transform the output of a tap-producing child-process, inspect its tap and exit status, and treat the child process as a nested test. For example, if I run foo.sh perhaps it outputs:

1..2
ok 1 - should still be equal
ok 2 - math might be broken

Then, in my test framework, I can do this:

plan(2)
assert(true, 'true is truthy')
spawn_test('bash', ['foo.sh'])

and have it output something like this, and I can just indent each line.

1..2
ok 1 - true is truthy
# starting foo.sh
  1..2
  ok 1 - should still be equal
  ok 2 - math might be broken
ok 2 - foo.sh

So, I'm going to go ahead and start implementing Test::More-style subtests in node-tap. It's extremely satisfactory, easy to parse, easy to read, easy to generate, and already widely in use. Any other syntax that meets my criteria will be pretty easy to switch over to once I have that working, and any syntax that doesn't meet my criteria isn't something I'm interested in anyway.