Improve & Address Bugs in`test_retrieval` the Batch Test Question DAG

Bug

Describe the bug

Traceback

[2024-02-13, 17:31:11 EST] {taskinstance.py:2699} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
result = execute_callable(context=context, **execute_callable_kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/decorators/base.py", line 242, in execute
return_value = super().execute(context)
               ^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
return_value = self.execute_callable()
               ^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/dags/monitor/test_retrieval.py", line 210, in generate_test_answers
questions_df[["askastro_answer", "askastro_references", "langsmith_link"]] = questions_df.question.apply(
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 4079, in __setitem__
self._setitem_array(key, value)
File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 4138, in _setitem_array
self._iset_not_inplace(key, value)
File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 4157, in _iset_not_inplace
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

question_number_subset
- The questions_df after adding debug logging is empty, this only occurs if someone puts in a subset of question ids
- The question_number_subset param isn't parsed correctly due to the incorrect code json.loads() which attempts to parse string into list of ints (but not correctly), leading to no questions being added here.

To Reproduce Steps to reproduce the behavior:

Have proper configuration of environment variables for the test_retrieval DAG
Trigger the DAG
Put a list of subset question ids in the parameter prompt, such as [1,2,3]
Errors out during DAG run

Expected behavior No errors

Improvements

The references saved in the csv are in random incorrect order. This is probably related to the fact that it is put into a set using {} somewhere.
The multi-query references and the weaviate search references are not relevant. They don't provide useful info but delays the pipeline and incurs cost.

astronomer / ask-astro

Improve & Address Bugs in`test_retrieval` the Batch Test Question DAG #298

Bug

Improvements