Closed crantila closed 7 years ago
I think that the expected values should always be in the same script as the test that calls them. While this does bloat the code of each test file a little bit, in my opinion it is much easier to deal expected's that are in the same script and this make the process of accessing them simpler and less error-prone.
I'm not sure of the best resolution to this. Note that this issue isn't about all "expected" values, but all the files with "expected" values. In any case, the point at which it makes sense to break an "expected" or "input" value into its own file is not always obvious, and I'm not convinced I got it right.
Consider this file (and others in that directory). This is a hand-written, hand-verified file. Before such verification, the data were held in "test_intervals.py," but when I went to count by hand, I found it was much easier to read and verify in a separate file. Plus, it's 113 lines long; if we include all four of the intervals "expecteds," that adds 480 lines to a file that's already 750 lines long—it's quite a lot. Of course we could make longer lines to save in the line count, but that compromises the ease of verification.
Another big benefit of this approach is that the same "expected" values are more easily shared between modules, and can also be used as "input" values for other tests. Those files full of intervals, for example, can be fed into the NGramIndexer
as part of a unit test. Otherwise we would have to copy-and-paste the intervals into "test_ngram.py". (Looks like copy-paste is how it currently is... oops). Anyway, if we do this properly, it means that changing the expected output from one Indexer has the side-effect of forcing us to make sure that it either already works with other Indexers, or that we modify the other Indexers so it works again.
As it is now, there are a bunch of files in the "tests" directory that seem to hold something that belongs in a Series or DataFrame, and it's not at all clear what they do. Plus, like you said, accessing these data is much more error-prone than if they were simply in the test files where they're used. Worse, the files that are pickled are totally unreadable unless we open them with pandas. So there are definitely problems with the existing approach.
Do you (does anyone) have ideas for compromising between all these issues at play?
I'm closing this because nobody cares.
They don't belong in "corpus" and they also don't really belong in "tests," but a dedicated directory makes sense. This will involve both moving the files and updating the tests that use them.