MrPigss / BetterJSONStorage

Better JSONStorage for tinyDB
https://pypi.org/project/BetterJSONStorage/
MIT License
33 stars 4 forks source link

large cpu usage #9

Open mnealer opened 4 months ago

mnealer commented 4 months ago

I have just noticed that using BetterJSONStorage with access_mode="r+" causes very high cpu usages, which I think is happening in the threading. I have a database being created or opened (its blank) in the init of a class. The Class initializes , but there continues to be massive CPU usage after that. This does not happen with the default and it doesn't happen if its read only

MrPigss commented 4 months ago

The current implementation has a separate thread for writing data. This thread then checks if there is anything to write of and over again (polling), resulting in a lot of cpu usage without any actual work being done by that thread.

I could:

  1. add a time.sleep(0) after every check. This is an optimization that forces python to switch threads without setting an actual timeout. This might lower cpu usage but i have no idea how much.

  2. add an actual small sleep time.sleep(.01). This would probably make a bigger difference but now your data will "at most" be written to disk once every 0.01 sec.

These would be the easiest/quickest solutions, but I think using something like threading.Event might be cleaner and more efficient.

I might take a look at how libs like aiofiles do it, since they also just use a threaded wrapper around standard python functions.

Feel free to open a PR.

mnealer commented 4 months ago

I was wondering if swapping from threading to Async coroutines would be better and to then send requests in and out via a Queue. I like the idea of shifting the file access to a separate process as i have an idea running around where there could be multiple TinyDB files acting as a single database.

On Wed, 29 May 2024 at 22:15, thomas Eeckhout @.***> wrote:

Reopened #9 https://github.com/MrPigss/BetterJSONStorage/issues/9.

— Reply to this email directly, view it on GitHub https://github.com/MrPigss/BetterJSONStorage/issues/9#event-12973566655, or unsubscribe https://github.com/notifications/unsubscribe-auth/AV437H2DAZAATAU2J3EPXZLZEXWJDAVCNFSM6AAAAABIODMQKSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSHE3TGNJWGY3DKNI . You are receiving this because you authored the thread.Message ID: @.***>

MrPigss commented 4 months ago

I was wondering if swapping from threading to Async coroutines would be better

Once you start using async functions in one place, everything will become async. It's definitely interesting and I might look into this in the near future but it would become a separate Storage option. Sometimes you have an entirely synchronous script and then threads (or processes, more on that later) are the only option.

then send requests in and out via a Queue.

This is definitely something I will play with later. In theorie using threading queues would be more efficient.

I like the idea of shifting the file access to a separate process.

This is actually interesting and would need some testing. Since Processes have separate GIL's, they run completely independent. They have the downside of no shared memory (threads do share the memory). This means you need to marshall every piece of data that needs to be shared between the processes. But since the data we want to write is already serialised into bytes there would only be a small overhead of copying over the data the the processes memory.

What task happens in what process should be tested aswel (serialisation and compression). I assume serialisation in the main process since we would need to marshall anyway, and compression in the other process since it's quite cpu intensive, but we'll see about that when the time comes.

I have an idea running around where there could be multiple TinyDB files acting as a single database.

The current implementation of TinyDB only requires read, write, and close functions for custom Storages. TinyDB runs it's queries in memory. Storage classes have no context about what is requested or written. So no partitioning of data over multiple files, since you'll have no idea what data is needed. You always have to read and write everything you have. Depending on the use case you have in mind, there might not be any benefit.

That said, it might be possible if you write custom middlewares, use hooks and overrides, or subclass TinyDB and Table, but it's been a while since I went over the sourcecode, so I'm not really sure how to handle this case.