Open ablack3 opened 1 year ago
I would suggest PandasAI and Synthea as starting points for this. A vector storage with instructions may be all the is required to get similar or improved results over Synthea -- with the added benefit of being less expensive than tuning a model and allowing the use of larger base models.
Another option is to take real data and generate synthetic data from it. Synthetic Data Vault is a good example of this. From, for example, 100 real records you can expand to >100. with GaussianCopula and CTGAN. It would be interesting to use this framework to add a third method of an LLM an evaluate between the three.
I’m interested in using this approach to see if we can create accurate synthetic data from a pretrained LLM. First step would be to have an evaluation framework. Opening this issue for discussion of this use case.