In the end I opted for generating the entire movie dataset. Largely because I'm not a lawyer and can't really guarantee that the "free for non-commercial use" files are OK for us. For benchmarks we don't care about real data, we care that the data has realistic structure and that we can create some arbitrary amount of it. The current data generator does that - it generates approximately the correct proportions of movies/directors/actors, etc.
There's a simple converter that takes this data and simply outputs a whole bunch of EQL. However, for a non-trivial dataset we're talking megabytes of EQL and if we ever want a dataset with 500,000 reviews that'll easily be close to 500Mb of EQL. So feeding it into the server may require a bit of tweaking the output and maybe producing multiple chunks. Incidentally, the really large datasets also take a few minutes to generate (people=10_000, users=10_000, reviews=50_000 -> ~3minutes).
In the end I opted for generating the entire movie dataset. Largely because I'm not a lawyer and can't really guarantee that the "free for non-commercial use" files are OK for us. For benchmarks we don't care about real data, we care that the data has realistic structure and that we can create some arbitrary amount of it. The current data generator does that - it generates approximately the correct proportions of movies/directors/actors, etc.
There's a simple converter that takes this data and simply outputs a whole bunch of EQL. However, for a non-trivial dataset we're talking megabytes of EQL and if we ever want a dataset with 500,000 reviews that'll easily be close to 500Mb of EQL. So feeding it into the server may require a bit of tweaking the output and maybe producing multiple chunks. Incidentally, the really large datasets also take a few minutes to generate (
people=10_000, users=10_000, reviews=50_000
-> ~3minutes).