Scattered notes after some attempted runs

Just evaluating this tool.

Some notes:

It's important that the user be able to edit and override anything, including past LLM outputs. (Obviously, this should be logged, but nothing should be "locked").
There should probably be a package deployment step. (I.e. after exploration, if it's geodata, then we need to load and install and evaluate geopandas. If it's qual data, we need other packages, etc.)
When it gets "stuck", it's important to be able to revert to earlier in the chain. (I've had two failed runs so far)
When running, your README says: "python data_to_paper/data_to_paper/run/run.py" Run no longer exists. it is available in the venv's. bin/ directory though. (and in scripts/)
Many of the tests no longer run.
Setting up a new run from old data should be more possible. (I.e. I'm just getting started with my third attempt at a given dataset, and need to find all the source descriptors)
Don't hardcode paths for files, reduces portability
I'd quite like to discuss your prompts and system prompts, but I'll submit the ones which work for me once I've been able to finish a run.
In the feedback loop cycle, having to toggle between output and feedback in the bottom right corner is somewhat tedious, especially with the giant prompt taking up most of the screen. Perhaps each session should be a series of accordions to show the history of that step's interactions (And to provide better editing capability?)
When dealing with qual data, it needs to have more emphasis on preparing the data (in terms of making sure that units are incorporated so they can be passed to subsequent contexts, that categorical data is correctly imported, etc)
(Just failed run 3 because it ran into an error and I cannot fix it -- as my keys weren't defined. Having a "test APIs" step before starting seems like a good idea? Also you should be able to fetch pricing information
Consider using JSON mode output (https://platform.openai.com/docs/guides/text-generation/json-mode)
A "sensemaking" check is needed. In your twitter example, the central claim is: "The negative values of the coefficients indicate that interactions are less likely between members from states that are numerically distant in the dataset’s coding system," and that's ... hilariously bad? (Also, a case where implementing geopandas is probably indicated.)
Being able to load in a bibliography seems like a good idea.

Hi @Denubis,

Thanks for testing data-to-paper and providing this list of comments and ideas! I will try to address most of them and would be happy to keep this discussion going:

It is a design choice. We were considering the best way to perform the interventions without making the process too overwhelming. We chose to put the user in the "reviewer" chair rather than in both the "performer" and the "reviewer" chairs.
One can choose which packages to allow in each coding step by setting the supported_packages in the designated CodeProductsGPT class. It is not currently exposed to the user in the app, but perhaps you can suggest how you would implement such a step?
Agree. Please see issue #6, which suggests implementing exactly this feature.
Will be fixed!
Can you test again on the latest commit on main? On my machine, all tests pass. If there are still failing tests, can you please open a new issue with a description of your system configuration (OS, how you installed data-to-paper, are you using a conda environment or other env manager) and which tests are failing (and, of course, how they are failing)? We can then look into them and fix them.
If I understood you correctly, it should be possible to save the configurations you defined for the specific project/dataset to a JSON file (using the Save button) and then load them (using the Open button).
Can you point to the specific hardcoded paths you have found? I will try to change them to relative paths.
Sure! It would be great to see your thoughts and ideas.
That's a nice suggestion! It would be very nice if you could put up a PR to implement (or even partially implement) that idea, and we can start a discussion there on how to improve UX regarding that part of the process.
I think there are two issues here: (1) The question of generalizability of the system to different types of data. If you implement a very specific suite of algorithmic checks and guardrails, you are effectively blocking the system from accepting other types of input data; (2) The practicality of implementing those algorithmic tests. How would you know that all the "correct" steps for each specific dataset were done? We tried to implement a "robust" series of comments and reviews to ensure this, but I think we are hitting the limits of current generation LLMs.
We are currently recording exceptions as part of the process. It is not obvious that this is the correct approach. We might decide to change this at some point, allowing you to continue from where you "got stuck" the last time. A workaround for now is to delete the latest line from the openai_responses.txt file that included the exception, fix the issue, and rerun the same project/dataset. Testing the API keys before the run is something we might implement - I will add a separate issue about this.
This will not allow the use of any LLM (not just OpenAI's models) as the underlying engine for data-to-paper. Of course, we can implement this for the specific case of using OpenAI models, but it seems like a significant overhaul that might not be worth it.
How would you conduct that "sensemaking" check other than manually vetting the results? I think, once again, we are hitting the limits of current LLMs. No system is 100% bulletproof against hallucinations or nonsensical outputs, even with multiple review iterations. Human intervention and oversight are necessary.
What do you mean by loading a bibliography, and in which step of the process?

Technion-Kishony-lab / data-to-paper

Scattered notes after some attempted runs #12