TheR1D / shell_gpt

A command-line productivity tool powered by AI large language models like GPT-4, will help you accomplish your tasks faster and more efficiently.
MIT License
9.82k stars 771 forks source link

Integration tests could be flakey if they depend on LLM output #446

Closed jeanlucthumm closed 10 months ago

jeanlucthumm commented 10 months ago

I ran bash scripts/test.sh at d908d1d and got failure:

$ bash scripts/test.sh
+ pytest tests
================================================================================================================================= test session starts =================================================================================================================================
platform linux -- Python 3.11.5, pytest-7.4.2, pluggy-1.3.0
rootdir: /home/jeanluc/Code/shell_gpt
plugins: asyncio-0.21.1, anyio-3.7.1
asyncio: mode=Mode.STRICT
collected 25 items                                                                                                                                                                                                                                                                    

tests/test_integration.py ..............F.........                                                                                                                                                                                                                              [ 96%]
tests/test_unit.py .                                                                                                                                                                                                                                                            [100%]

====================================================================================================================================== FAILURES =======================================================================================================================================
____________________________________________________________________________________________________________________________ TestShellGpt.test_repl_shell _____________________________________________________________________________________________________________________________

self = <tests.test_integration.TestShellGpt testMethod=test_repl_shell>

    def test_repl_shell(self):
        # Temp chat session from previous test should be overwritten.
        dict_arguments = {
            "prompt": "",
            "--repl": "temp",
            "--shell": True,
        }
        inputs = ["What is in current folder?", "Simple sort by name", "exit()"]
        result = runner.invoke(
            app, self.get_arguments(**dict_arguments), input="\n".join(inputs)
        )
        assert result.exit_code == 0
        assert "type [e] to execute commands" in result.stdout
        assert ">>> What is in current folder?" in result.stdout
        assert ">>> Simple sort by name" in result.stdout
>       assert "ls -la" in result.stdout
E       AssertionError: assert 'ls -la' in 'Entering shell REPL mode, type [e] to execute commands or [d] to describe the commands, press Ctrl+C to exit.\n>>> What is in current folder?\nls\n>>> Simple sort by name\nls | sort\n>>> exit()\n'
E        +  where 'Entering shell REPL mode, type [e] to execute commands or [d] to describe the commands, press Ctrl+C to exit.\n>>> What is in current folder?\nls\n>>> Simple sort by name\nls | sort\n>>> exit()\n' = <Result okay>.stdout

tests/test_integration.py:287: AssertionError
=============================================================================================================================== short test summary info ===============================================================================================================================
FAILED tests/test_integration.py::TestShellGpt::test_repl_shell - AssertionError: assert 'ls -la' in 'Entering shell REPL mode, type [e] to execute commands or [d] to describe the commands, press Ctrl+C to exit.\n>>> What is in current folder?\nls\n>>> Simple sort by name\nls | sort\n>>> exit()\n'
======================================================================================================================= 1 failed, 24 passed in 97.68s (0:01:37) =======================================================================================================================

This is because my sgpt returned ls instead of ls -a.

Since LLM output is non-deterministic, is it a good idea to have integration tests that depend on it?

TheR1D commented 10 months ago

I agree, this is not the best solution to test sgpt. Integration tests will be replaced by "mocked" tests in https://github.com/TheR1D/shell_gpt/pull/442