HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
129 stars 53 forks source link

Make hsds an application #34

Closed t20100 closed 4 years ago

t20100 commented 4 years ago

It would be convenient if hsds could "scale down" to a simple application so as to use it locally simply as an application without docker.

jreadey commented 4 years ago

Are you thinking of something like this: https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md?

t20100 commented 4 years ago

Yes pretty much, except that it could be run as a standalone server rather than being integrated into h5pyd, but I guess both can be achieved at once.

jreadey commented 4 years ago

This is implemented in PR #35, merged into master.

jreadey commented 4 years ago

Hey @t20100 - I'm not that familiar with aiohttp.Web.AppRunner, but it looks like it runs the server within the same process. I think it would be more efficient to run head, SN, and DN nodes in their own processes so that the server could utilize multiple cores.

What would you think about kicking off a subprocess for each SN, DN node? (the head node could live in the parent process)

t20100 commented 4 years ago

Hi,

Yes, as it is everything runs in the same process, I expect it should be simple to spawn subprocesses instead (the most complicated part is probably to pass the config to subprocesses). But I usually try to avoid subprocesses when I can, what do you think subprocess would improve?

jreadey commented 4 years ago

Thanks for confirming...

Here's an example where running in a subprocess would speed things up:

Say you have a SN worker and 4 DN workers. The SN gets a request to read a dataset selection crossing 10 chunks. The chunk read requests get spread out over the 4 workers. Each DN worker needs to fetch the chunk from S3 which is async, so not a big deal to do in one process (I think). But any uncompression will be hogging the CPU and blocking the other workers.

With docker I observe that we can get close to 100% CPU for each of 4 containers. For HSDS apps, being able to processes on their own core should be faster (ymmv).

t20100 commented 4 years ago

OK, I though no lengthy operation was done in either DN or SN loop, but if it is the case, yes, multiprocessing or maybe multithreading would be better.

BTW, when I did it, I set DN and SN count to one when started as application: https://github.com/HDFGroup/hsds/blob/master/hsds/app.py#L67 . That would need to be changed as well.

t20100 commented 4 years ago

I gave it a try with threads and more or less managed to make it passes the tests for 1 SN and 1 DN, but not with many... Where I stopped, I guessed that each DN should have a different config/env var DN_PORT.

The way the config is handled has changed (now in a yaml file), and the application would need to be updated accordingly.

I can't spend much time on this now, but I can come back on it later.

Also, related to this but which can be done independently, wouldn't it make sense to run the compression/decompression in a ThreadPoolExecutor (see https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor) so as to leverage the multiple cores even from a single DN node?

jreadey commented 4 years ago

Maybe it would make sense to have sub-processes or not be an additional sys arg. I can look at this myself later today or tomorrow.

Running compression.decompression in a thread pool is an idea, but my assumption for most deployments is that the number of DN nodes would be set to the number of cores. Therefore having multiple threads per node may not help (or actually hurt performance).

t20100 commented 4 years ago

Yes, you don't want to have more threads+processes running at the same time than the number of cores.