Synthetic data generation from LLMs

OHDSI / Apollo

[Under development] Assessment of Pre-trained Observational Large Language-models in OHDSI (APOLLO)

Other

11 stars 2 forks source link

I would suggest PandasAI and Synthea as starting points for this. A vector storage with instructions may be all the is required to get similar or improved results over Synthea -- with the added benefit of being less expensive than tuning a model and allowing the use of larger base models.

Another option is to take real data and generate synthetic data from it. Synthetic Data Vault is a good example of this. From, for example, 100 real records you can expand to >100. with GaussianCopula and CTGAN. It would be interesting to use this framework to add a third method of an LLM an evaluate between the three.

OHDSI / Apollo

Synthetic data generation from LLMs #7