Modify recommended test data sets size

mvdbeek commented 2 years ago

@nsoranzo I kind of prefer this over https://github.com/galaxyproject/tools-iuc/pull/4327

bgruening commented 2 years ago

Test data should be meaningful. And the smaller we set this limit the harder is it to create those. We have several tools that can hardly stick to the 1MB rule and the more complex the tool are the harder it seems to be to create small test data.

What are the concerns of big test data? Can we list them and then see if we can address those differences.

Good test data can be so much more than just test data. If we expose meaningful and good test data to the user, they can understand and learn tools more efficiently. We have this already in place with tool-describing tours. So I would rather consider not restricting tools devs here further - if we can avoid that.

peterjc commented 2 years ago

Good point Bjorn. If the test data can double as realistic sample data even better.

Should the same size recommendation apply both to bundled files, and those fetched on demand via a URL if/when that is possible?

mvdbeek commented 2 years ago

Test data should be meaningful.

To me the purpose of test data is to make sure we generate a valid command line for the tool. If you look at the test data in Galaxy you won't find a single realistic set of test data ... because that's not the purpose of these tests.

Good test data can be so much more than just test data. If we expose meaningful and good test data to the user, they can understand and learn tools more efficiently.

I doubt that just giving users some test data is enough to understand the input parameters of a tool. You could list some example data in the tool, that seems like a good idea, but I don't think we want to use these as tests.

To be clear, I don't mind keeping it at 1MB, I just think tests are for tests and that we'd need to do something else if we want to "demo" a tool.

bgruening commented 2 years ago

To me the purpose of test data is to make sure we generate a valid command line for the tool. If you look at the test data in Galaxy you won't find a single realistic set of test data ... because that's not the purpose of these tests.

In IUC and on other repos are a lot of good, valid test data. And we test more than just the command-line. We test dependencies and in update, we also look if a tool update changes the results in an expected way. At least I do this, when possible.

I doubt that just giving users some test data is enough to understand the input parameters of a tool. You could list some example data in the tool, that seems like a good idea, but I don't think we want to use these as tests.

It's not about the input parameters. Its about the general function of the tool. Many users would like to understand how an input needs to look like and how the output for example looks. I end up recommending using the test data to users to understand how inputs need to be formatted or simply to confirm the tool is working. We can link other test data, or we can reuse the test-data that either way needs to be crafted by the tool author. Why repeating the work here? The TDT is actually going on step further and explain parameters etc ... but that is then the next level.

I guess before we change the limit I would like to understand the reason for a limit at first. The old reason I remember is the size of the github repo, wich I think is a technical limitation that we should fix in otherways.

mvdbeek commented 2 years ago

We test dependencies and in update, we also look if a tool update changes the results in an expected way.

We ensure that we're still building correctly command lines that result in the expected result. Not more, not less. 1) It's not our job to make sure the underlying tool is correct. Yes we can report issues we find, but we shouldn't waste our time on this or be expected to be some sort of external CI. 2) All the bugs we've found in bedtools for instance were present in tiny test data.

How big is big enough to be realistic anyway ? There's no realistic limit on that, is there ?

I guess before we change the limit I would like to understand the reason for a limit at first. The old reason I remember is the size of the github repo, wich I think is a technical limitation that we should fix in otherways.

The number one thing is test runtime.

Many users would like to understand how an input needs to look like and how the output for example looks.

I disagree with the premise, but I would like to point out that this is much clearer with synthetic test data than with large "realistic" data.

The TDT is actually going on step further and explain parameters etc ... but that is then the next level.

The TDT has zero information to go on about explaining the parameter, beyond what is in the tool interface anyway. Why does it matter if the test data is large for this purpose? How do you want the TDT to explain to users what is important about the test data, where it came from etc? Also, which of the many tests that a good wrapper has is the one the TDT should explain ? Again, that's another discussion we can have about a demo mode. Using test data for this seems wrong to me.

I agree that users should be able to run the test from the interface if they want to know that the tool actually works, but what's the point of that test case being large ? I feel like you're concluding that seeing test data makes users understand the test data and why it's been chosen. I think that is a huge jump, where the TDT approach has no way of taking the user by the hand and little way of taking the intermediate steps. The TDT idea of re-using test cases we already have sounds good in theory, but I don't think it is helpful in practice.

If that's the reason for large test data I think we're hampering our testing efforts for very little benefit.

bgruening commented 2 years ago

We test dependencies and in update, we also look if a tool update changes the results in an expected way.

We ensure that we're still building correctly command lines that result in the expected result. Not more, not less.
1. It's not our job to make sure the underlying tool is correct. Yes we can report issues we find, but we shouldn't waste our time on this or be expected to be some sort of external CI.

2. All the bugs we've found in bedtools for instance were present in tiny test data.

Agreed. Where do we disagree here ;) The question is what have expected results, how can you define them, and what size of inputs we need. There are just plenty of tools that do need test data of > 500kb and a few even bigger than 1MB. The trick to moving them to the bgruening repo, because they are not applying to IUC guidelines is not good and an artificial barrier for tool devs. And we waste a lot of time reducing test data to under 1MB, its the number 1 hurdle of tools devs imho.

I do agree to reduce test data as much as possible and I think we should have a recommendation, enforcing this and giving no way out is problematic imho - reducing 1MB to 500kb is just increasing the pain.

How big is big enough to be realistic anyway ? There's no realistic limit on that, is there ?

As small as possible is the current and good recommendation. No one is talking about big or large.

I guess before we change the limit I would like to understand the reason for a limit at first. The old reason I remember is the size of the github repo, wich I think is a technical limitation that we should fix in otherways.

The number one thing is test runtime.

When did we had problems with runtime? We had several problems with file size and always need to remind people to shrink their data, but runtime? Maybe with mothur and 120 tools?

Many users would like to understand how an input needs to look like and how the output for example looks.

I disagree with the premise, but I would like to point out that this is much clearer with synthetic test data than with large "realistic" data.

I do not talk about realistic data, as in real-world data, but data that at least creates a meaningful output (aka not empty ;)). This ends up very often as synthetic data, after many many hours of fiddling. I have done this too many times e.g. for GalaxyP, this is not fun.

The TDT is actually going on step further and explain parameters etc ... but that is then the next level.

The TDT has zero information to go on about explaining the parameter, beyond what is in the tool interface anyway. Why does it matter if the test data is large for this purpose? How do you want the TDT to explain to users what is important about the test data, where it came from etc? Also, which of the many tests that a good wrapper has is the one the TDT should explain? Again, that's another discussion we can have about a demo mode. Using test data for this seems wrong to me.

And we have and had a plan for that.

I agree that users should be able to run the test from the interface if they want to know that the tool actually works, but what's the point of that test case being large ? I feel like you're concluding that seeing test data makes users understand the test data and why it's been chosen. I think that is a huge jump, where the TDT approach has no way of taking the user by the hand and little way of taking the intermediate steps. The TDT idea of re-using test cases we already have sounds good in theory, but I don't think it is helpful in practice.

If that's the reason for large test data I think we're hampering our testing efforts for very little benefit.

Sorry, I'm not talking about large test data. As small as possible. We see that we can create good tests with 1MB for 90% (?) of the tools, and maybe 99% with 1-10MB? We could look over the tools that we have if we want to have proper data.

My button line is to not change the test-data recommendation, but get the URI idea that we are talking about for 5 years implemented instead.

mvdbeek commented 2 years ago

My button line is to not change the test-data recommendation, but get the URI idea that we are talking about for 5 years implemented instead.

Do you want to spec out how this should work and where we store the data and what the versioning should look like and a back of the envelope calculation of how much that will cost given the storage limits you think we'll need ? Something in the Galaxy ecosystem should probably keep a hand on the test data so it doesn't disappear, and it should be open to the community, so those are some of the immediate challenges we'd have to work out.

bgruening commented 2 years ago

I should :( I will make this a priority for me next week. I have this here as an initial idea: https://github.com/bgruening/test-data-registry

(Nextflow and Snakemake would be interested to join this effort)

I will write my ideas down.

bgruening commented 2 years ago

Ok, I added some initial thoughts here: https://github.com/bgruening/test-data-registry/blob/main/README.md

galaxy-iuc / standards

Modify recommended test data sets size #67