OHDSI / Apollo

[Under development] Assessment of Pre-trained Observational Large Language-models in OHDSI (APOLLO)
Other
11 stars 2 forks source link

Synthetic data generation from LLMs #7

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

I’m interested in using this approach to see if we can create accurate synthetic data from a pretrained LLM. First step would be to have an evaluation framework. Opening this issue for discussion of this use case.

haydenbspence commented 1 year ago

I would suggest PandasAI and Synthea as starting points for this. A vector storage with instructions may be all the is required to get similar or improved results over Synthea -- with the added benefit of being less expensive than tuning a model and allowing the use of larger base models.

Another option is to take real data and generate synthetic data from it. Synthetic Data Vault is a good example of this. From, for example, 100 real records you can expand to >100. with GaussianCopula and CTGAN. It would be interesting to use this framework to add a third method of an LLM an evaluate between the three.