API to list all nodes in all projects

naved001 commented 6 years ago

I think we should have an admin only API to list all nodes in every project. This is something that we need to know.

Currently we can achieve this by running list_project_nodes on all projects , but it's quite slow (depending on the number of projects we have).

@zenhack Wanted to get your thoughts before I implement this.

zenhack commented 6 years ago

So would this just be a variant on list all nodes where we also get a field telling us what project they're in?

Have you tried parallelizing the individual list_project_nodes calls? That should help some, maybe enough for a stopgap while we think about the longer term picture.

I worry about cases like this, where we have an obvious way to do something, but we're talking about adding a shortcut because composing existing operations is too slow. Minimalism was a big goal, but if we're not able to express the things we need in terms of the minimal operations, I'd rather step back for a moment and ask whether there's some structural change we can make to avoid this sort of scenario. I'd rather not get into a situation where we have calls for x y and z, and then also shortcuts for x-then-y, x-then-z, y-then-z, x-then-y-then-z... We should instead see if we can't find a way to provide primitives that actually compose nicely.

naved001 commented 6 years ago

So would this just be a variant on list all nodes where we also get a field telling us what project they're in?

Pretty much.

Have you tried parallelizing the individual list_project_nodes calls? That should help some, maybe enough for a stopgap while we think about the longer term picture.

Good point. I tried to parallelize my requests using python's multiprocessing module but the results were the same (took like 11-12 seconds to return nodes in 8 projects). Maybe I am doing it wrong. Relevant code snippet.

import requests
import os
import ast
from pprint import pprint
from multiprocessing import Pool

user = os.environ.get('HIL_USERNAME')
password = os.environ.get('HIL_PASSWORD')
url = os.environ.get('HIL_ENDPOINT') + '/v0/'
r = requests.get(url+'projects', auth=(user, password))
projects = ast.literal_eval(r.content)

def get_nodes(project):
    api = 'project/' + project + '/nodes'
    nodes = requests.get(url+api, auth=(user, password))
    key ={}
    key[project] = nodes.content
    return key

p = Pool(processes=5)
pprint(p.map(get_nodes, projects))

I'll get back to it later I guess, and for now just use the slower sequential version.

I worry about cases like this, where we have an obvious way to do something, but we're talking about adding a shortcut because composing existing operations is too slow. Minimalism was a big goal, but if we're not able to express the things we need in terms of the minimal operations, I'd rather step back for a moment and ask whether there's some structural change we can make to avoid this sort of scenario. I'd rather not get into a situation where we have calls for x y and z, and then also shortcuts for x-then-y, x-then-z, y-then-z, x-then-y-then-z... We should instead see if we can't find a way to provide primitives that actually compose nicely.

Good point. Let's see if I can do it with parallel calls, otherwise we'll think about the bigger picture first before adding yet another (similar) API.

zenhack commented 6 years ago

How many threads is the server using? the example apache config in INSTALL.rst specifies two, which is going to limit concurrency.

naved001 commented 6 years ago

I specified four threads there

WSGIDaemonProcess hil user=hil group=hil threads=4

zenhack commented 6 years ago

Another thought: the performance problem we hit with CI way back could be rearing its head when we're legitimately making a ton of api calls; password auth is going to be slow. If the parallelization doesn't help enough it might be interesting to see if that's a bottleneck.

zenhack commented 6 years ago

Ok -- see if cranking that number up makes it any better.

Quoting Naved Ansari (2018-11-16 15:29:30)

I specified four threads there WSGIDaemonProcess hil user=hil group=hil threads=4

-- You are receiving this because you were mentioned. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

Verweise

https://github.com/CCI-MOC/hil/issues/1054#issuecomment-439517731

https://github.com/notifications/unsubscribe-auth/AA18PsRyh8SZbGkPqM6ky352L1ZqcDmGks5uvyAqgaJpZM4YXYqo

naved001 commented 5 years ago

Ok -- see if cranking that number up makes it any better.

I increased the number there to 16, and also upped the number of vCPUs allocated to the HIL server to 16, and there was no change.

zenhack commented 5 years ago

A couple more ideas:

I noticed you have process=5 in the client -- if you up that to 16 as well does it make a difference?
Is it similarly problematic on a local dev instance? Does changing the auth backend (maybe to a noop, although obviously don't touch that on prod...) affect it at all?

naved001 commented 5 years ago

I noticed you have process=5 in the client -- if you up that to 16 as well does it make a difference?

Tried it and that didn't make a difference.

@apoorvemohan suggested that we try increasing the number of processes rather than threads, and that sped things up. Concurrent requests were served in 5 seconds as opposed to 25 seconds earlier. Why did this work and would this be a problem?

zenhack commented 5 years ago

The python interpreter doesn't actually support multiple threads executing Python code at once; there's a single mutex that covers the entire interpreter state (the "Giant Interpreter Lock (GIL)"). Threads can still help if the workload is IO-bound, since blocking IO calls will typically release the lock, allowing other threads to run while blocked on IO. But if it's a cpu-bound workload extra threads will do nothing -- if you want Python CPU parallelism you need processes.

Just having the server configured with more processes sounds perfectly reasonable.

This throws a little more evidence towards the idea that the bottleneck is the auth backend -- that's all CPU bound, whereas the queries themselves should just be hitting the database.

I'd still like to know if swapping in the null auth backend fixes it, though obviously you need to reproduce the problem somewhere other than prod first.

If so, maybe we should be looking into more performant alternatives.

naved001 commented 5 years ago

Thanks for the explanation, Ian!!

I'd still like to know if swapping in the null auth backend fixes it, though obviously you need to reproduce the problem somewhere other than prod first.

Yep, I'll try that.

naved001 commented 5 years ago

So now that I know how to do multiple requests, I tried adding this as a CLI feature to get all nodes in all projects.

def _get_nodes(project):
    """Helper function to use with p.map"""
    key = {}
    nodes = client.project.nodes_in(project)
    key[project] = nodes
    return key

@project_node.command(name='listall')
@click.option('--json', 'jsonout', is_flag=True)
def project_node_list(jsonout):
    """List all nodes attached to a <project>"""

    projects = client.project.list()
    p = Pool(processes=len(projects))
    pprint(p.map(_get_nodes, projects))
    return

but it returns this

None: Max retries exceeded with url: /v0/project/<one-of-the-project-from-the-list>/nodes (Caused by None)

If I sleep for like 5 seconds after I get the list of projects, it works fine. Any insights on why this is happening? The other code in this thread which just uses requests directly works fine.

zenhack commented 5 years ago

Unrelated to your immediate question, but it got me looking more closely at the code so I'll bring it up:

Don't use ast.literal_eval to parse json; use json.loads. The Python syntax isn't a strict superset of json, so certain inputs will fail that shouldn't, chiefly those involving booleans or null, which don't share the same representation. There are also inputs that will parse when they should fail.

Also every time I see ast.literal_eval I almost have a heart attack until I realize that literal_eval isn't actually unsafe like eval is -- but you really have to know the API to be able to look at that and be sure it's not introducing an RCE; json.loads is more obviously tame and we use it everywhere so folks know what it does.

Also, a more minor point, instead of pprint I'd suggest using something like json.dumps(..., indent=' ', sort_keys=True); it's a little gross that we're showing python literal syntax to the end user.

Sorry for the lecture. To address your actual question, I'm not sure why there would be a difference in behavior between the client lib and the manual http code above. Is there anything interesting in the server logs about why it's refusing connections? Do we have a max concurrent connections limit set somewhere that you're exceeding?

naved001 commented 5 years ago

Sorry for the lecture

I always appreciate your insights! I have learned so much from you.

On Fri, Jan 18, 2019, 5:34 PM Ian Denhardt <notifications@github.com wrote:

Unrelated to your immediate question, but it got me looking more closely at the code so I'll bring it up:

Don't use ast.literal_eval to parse json; use json.loads. The Python syntax isn't a strict superset of json, so certain inputs will fail that shouldn't, chiefly those involving booleans or null, which don't share the same representation. There are also inputs that will parse when they should fail.

Also every time I see ast.literal_eval I almost have a heart attack until I realize that literal_eval isn't actually unsafe like eval is -- but you really have to know the API to be able to look at that and be sure it's not introducing an RCE; json.loads is more obviously tame and we use it everywhere so folks no what it does.

Also, a more minor point, I'd suggest using something like json.dumps(..., indent=' ', sort_keys=True); it's a little gross that we're showing python literal syntax to the end user.

Sorry for the lecture. To address your actual question, I'm not sure why there would be a difference in behavior between the client lib and the manual http code above. Is there anything interesting in the server logs about why it's refusing connections? Do we have a max concurrent connections limit set somewhere that you're exceeding?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CCI-MOC/hil/issues/1054#issuecomment-455709935, or mute the thread https://github.com/notifications/unsubscribe-auth/ATLmqOt_bc2sANVf0XOUW3sl_g_GZk51ks5vEkvngaJpZM4YXYqo .

CCI-MOC / hil

API to list all nodes in all projects #1054