digitalfabrik / integreat-cms

Simplified content management back end for the Integreat App - a multilingual information platform for newcomers
https://digitalfabrik.github.io/integreat-cms/
Apache License 2.0
55 stars 33 forks source link

Meta: :abcd: Large synthetic dataset for performance evaluation #2516

Open PeterNerlich opened 9 months ago

PeterNerlich commented 9 months ago

This issue is about research into improving the development workflow when interested in performance bottlenecks. While we could just create a copy of the live system for local experimentation (the same as we do with the test system every so often), it might contain personal information which we rather don't want to even be able to obtain as developers.

This should be separate from the existing test_data.json fixture, as the small dataset is highly preferable during quick iterations on a feature, except when performance with large data is the focus. The developer should be able to switch between them with relative ease.

timobrembeck commented 9 months ago

Just for completeness, I want to mention

./tools/integreat-cms-cli duplicate_pages augsburg

which can be used to generate a lot of pages, however it does not cover specific edge cases which are not reflected in the original test data. So one solution could be to create a more diverse baseline test data, which hopefully would result in a more realistic dataset if the duplication algorithm is executed a few times (~1k pages for large regions is realistic).

david-venhoff commented 9 months ago

however it does not cover specific edge cases which are not reflected in the original test data

One edge case example would be #2530, where performance testing requires lots of different links, which cannot be created using the duplicate_pages tool

david-venhoff commented 2 months ago

@timobrembeck when using dumpdata on the production database, the resulting file is about 8GB big. I assume this is too much to include in this repo? :smile: But maybe it is fine to just use database dumps if we need to test for performance?

Edit: When compressed it is about 740MB

PeterNerlich commented 2 months ago

For eliminating sensitive data from the dump – are we looking at anything not detectable in a fixed format? Because otherwise we could just detect names, telephone numbers, email addresses etc. and replace them with generic ones. Maybe even build a dictionary of which ones we previously replaced and re-use the generated one whenever we encounter the same detail again, if we feel fancy.

timobrembeck commented 2 months ago

@timobrembeck when using dumpdata on the production database, the resulting file is about 8GB big. I assume this is too much to include in this repo? 😄 But maybe it is fine to just use database dumps if we need to test for performance?

Edit: When compressed it is about 740MB

@david-venhoff I think 100MB is the limit for GitHub repos. But ok, we could store it in the nextcloud to facilitate it a bit to distribute this data set among us developers.

For eliminating sensitive data from the dump – are we looking at anything not detectable in a fixed format? Because otherwise we could just detect names, telephone numbers, email addresses etc. and replace them with generic ones. Maybe even build a dictionary of which ones we previously replaced and re-use the generated one whenever we encounter the same detail again, if we feel fancy.

@PeterNerlich not quite sure, I think Names could be hard to replace automatically. Maybe we're better off keeping the large test data set semi-private in our nextcloud and share it only amongst trustworthy developers :grin:


I think the only clean open-source way of doing it would be to return to the initial idea to create all the content dynamically.

david-venhoff commented 2 months ago

But ok, we could store it in the nextcloud to facilitate it a bit to distribute this data set among us developers.

I have uploaded the (unmodified) test data to https://nextcloud.tuerantuer.org/index.php/f/6466548