3dgeo-heidelberg / pytreedb

Python package providing a file and object-based database to store tree objects.
Other
32 stars 4 forks source link

Performance tests with large dataset [REPLACEMENT ISSUE] #29

Closed lwiniwar closed 2 years ago

lwiniwar commented 3 years ago

The original issue

Id: 8
Title: Performance tests with large dataset

could not be created. This is a dummy issue, replacing the original one.In case the gitlab repository still exists, visit the following link to see the original issue:

https://gitlab.gistools.geog.uni-heidelberg.de/giscience/3DGeo/pytreedb/-/issues/8

lwiniwar commented 2 years ago

In GitLab by @bhoefle-3dgeo on Dec 7, 2021, 18:16

Any progress here?

lwiniwar commented 2 years ago

In GitLab by @annachiu7 on Dec 7, 2021, 22:01

Sorry, not yet. Still focusing on fixing some frontend issues..

lwiniwar commented 2 years ago

In GitLab by @annachiu7 on Dec 8, 2021, 12:29

I just tested the runtimes with 10000 trees/files (about 10 times of the existing data). The results are not looking very good.

Here is a comparison:

Importing current data needs 1.5427641868591309 seconds
Get stats 0.04687619209289551 seconds
Query species 0.047281742095947266 seconds
More complicated query (returning 9 trees) 0.10981130599975586 seconds
===============
Importing 10 times of data needs 15.761247873306274 seconds
Get stats 0.15451645851135254 seconds
Query species 0.5060396194458008 seconds
More complicated query (returning 10000 trees) 23.933866500854492 seconds

It's understandable that the importing must take more time when the size of data increases. But because of how the querying helper functions are written, the retrieval takes too much time on big data sets: a simple query takes 100 times longer, and even more so for complicated queries.

We should indeed consider using pandas

lwiniwar commented 2 years ago

In GitLab by @bhoefle-3dgeo on Feb 18, 2022, 11:11

Solved by switching to MongoDB backend. Performance is now mainly related to chosen MongoDB backend / cloud performance.