Open PeterNerlich opened 1 year ago
Just for completeness, I want to mention
./tools/integreat-cms-cli duplicate_pages augsburg
which can be used to generate a lot of pages, however it does not cover specific edge cases which are not reflected in the original test data. So one solution could be to create a more diverse baseline test data, which hopefully would result in a more realistic dataset if the duplication algorithm is executed a few times (~1k pages for large regions is realistic).
however it does not cover specific edge cases which are not reflected in the original test data
One edge case example would be #2530, where performance testing requires lots of different links, which cannot be created using the duplicate_pages
tool
@timobrembeck when using dumpdata on the production database, the resulting file is about 8GB big. I assume this is too much to include in this repo? :smile: But maybe it is fine to just use database dumps if we need to test for performance?
Edit: When compressed it is about 740MB
For eliminating sensitive data from the dump – are we looking at anything not detectable in a fixed format? Because otherwise we could just detect names, telephone numbers, email addresses etc. and replace them with generic ones. Maybe even build a dictionary of which ones we previously replaced and re-use the generated one whenever we encounter the same detail again, if we feel fancy.
@timobrembeck when using dumpdata on the production database, the resulting file is about 8GB big. I assume this is too much to include in this repo? 😄 But maybe it is fine to just use database dumps if we need to test for performance?
Edit: When compressed it is about 740MB
@david-venhoff I think 100MB is the limit for GitHub repos. But ok, we could store it in the nextcloud to facilitate it a bit to distribute this data set among us developers.
For eliminating sensitive data from the dump – are we looking at anything not detectable in a fixed format? Because otherwise we could just detect names, telephone numbers, email addresses etc. and replace them with generic ones. Maybe even build a dictionary of which ones we previously replaced and re-use the generated one whenever we encounter the same detail again, if we feel fancy.
@PeterNerlich not quite sure, I think Names could be hard to replace automatically. Maybe we're better off keeping the large test data set semi-private in our nextcloud and share it only amongst trustworthy developers :grin:
I think the only clean open-source way of doing it would be to return to the initial idea to create all the content dynamically.
But ok, we could store it in the nextcloud to facilitate it a bit to distribute this data set among us developers.
I have uploaded the (unmodified) test data to https://nextcloud.tuerantuer.org/index.php/f/6466548
This issue is about research into improving the development workflow when interested in performance bottlenecks. While we could just create a copy of the live system for local experimentation (the same as we do with the test system every so often), it might contain personal information which we rather don't want to even be able to obtain as developers.
This should be separate from the existing
test_data.json
fixture, as the small dataset is highly preferable during quick iterations on a feature, except when performance with large data is the focus. The developer should be able to switch between them with relative ease.