bazingagin / npc_gzip

Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
MIT License
1.77k stars 156 forks source link

Test occasionally fails using top_k of 2 with just 1 sample #45

Closed EliahKagan closed 1 year ago

EliahKagan commented 1 year ago

I've noticed that test_predict in test_knn_classifier.py occasionally fails. I've seen this happen a few times. It does not usually happen. One example is this test run (which is a CI run on #43, but this is in no way specific to #43, which only changes metadata in pyproject.toml). Re-running the tests passes.

It looks like the problem has to do with how the test uses random numbers. Here's the most relevant code from the test:

https://github.com/bazingagin/npc_gzip/blob/b05a7bb80f07b7c32edf80e34bfc6eedf637eacd/tests/test_knn_classifier.py#L119-L126

Note that test_set_size is chosen randomly and can be a small as 1, but the test uses a top_k of 2. This seems to be the only problem, and I've proposed a fix in #46.

For convenience, when the test fails, it shows:

>       assert (
            top_k <= x.shape[0]
        ), f"""
        top_k ({top_k}) must be less or equal to than the number of
        samples provided to be predicted on ({x.shape[0]})

        """
E       AssertionError: 
E               top_k (2) must be less or equal to than the number of
E               samples provided to be predicted on (1)

npc_gzip/knn_classifier.py:309: AssertionError
----------------------------- Captured stderr call -----------------------------

Compressing input...:   0%|          | 0/1 [00:00<?, ?it/s]
Compressing input...: 100%|██████████| 1/1 [00:00<00:00, [121](https://github.com/bazingagin/npc_gzip/actions/runs/5753719296/job/15597516762?pr=43#step:7:122).54it/s]
- generated xml file: /Users/runner/work/npc_gzip/npc_gzip/junit/test-results-macos-3.9.xml -
=========================== short test summary info ============================
FAILED tests/test_knn_classifier.py::TestKnnClassifier::test_predict - AssertionError: 
        top_k (2) must be less or equal to than the number of
        samples provided to be predicted on (1)
========================= 1 failed, 45 passed in 9.95s =========================

I have only shown the end of the output. Full output can be seen in the failing test run. The code that pytest includes in the output is from the npc_gzip.knn_classifier.KnnClassifier.predict method.