AI-Planning / planning-as-a-service

The newly improved planner (and more) in the cloud.
Apache License 2.0
30 stars 8 forks source link

Replace the existing solver API end point. #32

Open YIDING-CODER opened 2 years ago

YIDING-CODER commented 2 years ago

As for the solver, I couldn't find the stuff Cam did to make it backwards compatible...maybe that got removed at some point? We can leave the existing one up, but I'd rather not make something look official if we're just going to eventually shift to hosting it at solver.PD (which I think makes sense). This isn't meant to be a "sibling project", but rather a complete rewrite of the existing one. So, for now, just go with hard-coded IP's to keep it looking unofficial, and create an issue assigned to me to have it folded into the current setup (requires a fair bit of server setup wrangling, so best to just let Nir & I take care of it).

haz commented 2 years ago

I'm actually second-guessing the tact here...90% of the solver usage, I reckon, will come from the online editor and/or the VSCode plugin. How about this (CC @nirlipo @jan-dolejsi)...

  1. We make sure the default solver on the editor points to PaaS, and not the existing end-point. Mirror the editor's functionality.
  2. @jan-dolejsi optionally do the same with the VSCode plugin.
  3. I add an endpoint for PaaS that returns an error message saying the usage is deprecated, pointing them to the new API
  4. We swap out the solver.PD endpoints for the new PaaS

That would likely minimize the impact, no?

haz commented 2 years ago

Looks like we have the go-ahead.

@YIDING-CODER : What's the status of the new plugin?

@jan-dolejsi : Do you want any help in re-working the plugin call to the planner?

@nirlipo : Can you check to make sure the right ports are open for public access? E.g., I'd expect http://45.113.232.43/docs/lama or http://45.113.232.43/package to work

nirlipo commented 2 years ago

Yes, we are ready for deployment!

Port 80 should be open but currently we are using port 5001. The links http://45.113.232.43:5001/docs/lama and http://45.113.232.43:5001/package are working correctly.

See https://github.com/AI-Planning/planning-as-a-service#api regarding the ports we are using.

We would need to change all these files https://github.com/AI-Planning/planning-as-a-service/search?q=5001 to change to port 80

haz commented 2 years ago

@nirlipo Perfect, thanks! I missed that port 5001 bit of info. I was able to take it across the finish line with https://github.com/AI-Planning/planutils/pull/72 on the current server (will swap out the base URL once we re-map to solver.PD).

jan-dolejsi commented 2 years ago

@haz do you have a description somewhere for what the client interaction with the service(s) will be? Something like this: http://solver.planning.domains/ description. I also see that the new interface is asynchronous. I forgot to mention that I prepared a openapi yaml describing a unified sync/async planning service interface. Going forward, I do want to add the possibility to elevate the http connection to a websocket, so the service can feed back live updates about plans, states, etc... Ultimate diagnosability!

haz commented 2 years ago

Would be a wonderful future for it! Open socket, that is.

For now, I've just looked through this to get a sense of the API, and this is an example of how I turned it synchronous. The initial call returns a URL extension with a big code to check on the status. That code just pings it every 0.5s until the result is ready.

You can see another example of the call structure here. The reason it isn't rigid and written somewhere is because packages in planutils can write their own manifest configuration to customize the endpoint however they want. Powerful stuff! If you would rather avoid opening up all the services, then this endpoint is a wrapper to it that just uses the default (fast) planner. I think it would just require domain and problem to be in the JSON with the appropriate PDDL.

jan-dolejsi commented 2 years ago

Ok, pretty cool. I managed to pull the manifest from the URL above, constructed a URL to call a selected solver, got back the URL to poll for results. I can implement the package selection in VS Code. That should not be a problem. And whether the solver endpoint returns the plan, or the result URL, I can follow the conversation and bring the plan(s?) back to the VS Code UI.

Do you have a schema or something for what the manifest may contain when it comes to other arguments and switches passable to the solver? Like for example this list of options (and ideally valid values for the options that I am hardcoding for the planners that I got my hands on: https://github.com/jan-dolejsi/vscode-pddl/blob/09e2199147718b430f0de6edd8a7014f7b63d2a0/src/configuration/plannerConfigurations.ts#L361

Btw, any chance we could implement a long-polling style endpoint for fetching the plan(s)? That would mean that the client passes a header with number of seconds it is willing to wait for the plan. If a plan is found within that timeout, the solver returns the plan. If the timeout is about to expire, the solver returns the call with a "call again" status. That way we would not need to poll the poor service and we would get the plan(s) to the client as soon as they are found. I have an Openapi spec for that. Along with restful api for fetching and pruning the requests and associated plans.

YIDING-CODER commented 2 years ago

Looks like we have the go-ahead.

@YIDING-CODER : What's the status of the new plugin?

Since we have deploy the latest paas to server, I have to test the plugin. I will try to do the testing tomorrow.

I feel like the VS Code plugin could use the same API endpoits I used for Planing Editor plugin. I will provide the API detailed as well.

jan-dolejsi commented 2 years ago

I feel like the VS Code plugin could use the same API endpoints I used for Planning Editor plugin. I will provide the API detailed as well.

@YIDING-CODER , yes please. Is it going to be exposed at an address like http://solver.planning.domains/package?

YIDING-CODER commented 2 years ago

I feel like the VS Code plugin could use the same API endpoints I used for Planning Editor plugin. I will provide the API detailed as well.

@YIDING-CODER , yes please. Is it going to be exposed at an address like http://solver.planning.domains/package?

@jan-dolejsi Yes, you are correct. Currently, it is deployed at 45.113.232.43 and later we will point the solver.planning.domains to this IP address.

Below are the APIs that may relevant to you:

  1. You can get all the available planners by sending a GET request to /package endpoint.
  2. Then you can send a post request with relevant parameters to /package/{solver_name}/solve to add a task to PAAS. It will return a task ID.
  3. Later, you can send a post request to /check/{task_id} with { "adaptor": "planning_editor_adaptor" } in the body. It will return the parsed result like the existing solver(solver.planning.domains).

You can find the python example on this notebook.

You can also find the js example here. Please have a look at lines 52,154 and 217.

haz commented 2 years ago

Thanks @YIDING-CODER ! Just wanted to add a few more concrete endpoints (for now, port 5001) is required:

Do you have a schema or something for what the manifest may contain when it comes to other arguments and switches passable to the solver?

The beautiful thing about this setup is that it's service/package defined. So it may be custom to any newly added planutils package. Though, we do have templates for common planner usage (domain/problem parameters, etc).

bring the plan(s?)

Yep, plans are now possible. You can see it in the two more complicated examples above.

Btw, any chance we could implement a long-polling style endpoint for fetching the plan(s)?

This sounds pretty awesome! Any chance you could elaborate in a new issue? Definitely feels like we should move to a model like this, and potentially also allow for partial data to be return (plans as they're generated, search updates, etc). I can imagine it mirroring the command-line output as the problem is solved.

jan-dolejsi commented 2 years ago

I occasionally get this error: Error: This Planutils package is not configured correctly.

But when it works, it starts looking cool: image

I wish the manifest listed the PDDL :requirements that the given planner supports, so I could filter it by the set of requirements declared in the domain/problem that is being sent to the planner.

nirlipo commented 2 years ago

Great to see it working in Vscode @jan-dolejsi ! In terms of requirements, in the long term, we may want to auto populate the requirements in the manifests by testing planners with PDDL files designed specifically to test support of different features.

haz commented 2 years ago

This looks awesome, @jan-dolejsi!

we may want to auto populate the requirements in the manifests by testing planners with PDDL files designed specifically to test support of different features.

I think that's precisely the plan, yep! This could do the trick: https://github.com/nergmada/eviscerator

Error: This Planutils package is not configured correctly.

Not sure where this one is coming from. When does it crop up?

jan-dolejsi commented 2 years ago

Error: This Planutils package is not configured correctly.

This is happening right now. It comes back as a plain ASCII (not JSON) HTTP response body, when I send POST /package/optic/solve with a domain and a problem.

jan-dolejsi commented 2 years ago

Next, during the last couple of days I did not manage to get all the way to a plan. Instead I get 'PENDING' and then in the next poll just seconds later I get this:

{
    "result": {
        "call": "optic domain problem >> plan",
        "output": {},
        "output_type": "log",
        "stderr": "",
        "stdout": "Request Time Out"
    },
    "status": "ok"
}

It is a simple blocksworld problem that I am sending. And I am getting the same Request Time Out from lama-first too.

haz commented 2 years ago

@nirlipo @YIDING-CODER : There an ongoing issue with the solver?

@jan-dolejsi : Can you share the token for one of the failed calls so they can dig into the db on it?

jan-dolejsi commented 2 years ago

Can you share the token for one of the failed calls so they can dig into the db on it?

/check/4c0f6ce7-6f7d-41f2-b4c5-bb948d14e556?external=True

haz commented 2 years ago

Hrmz...looks like a timeout on it:

image

Have one for optic?

YIDING-CODER commented 2 years ago

Error: This Planutils package is not configured correctly.

This is happening right now. It comes back as a plain ASCII (not JSON) HTTP response body, when I send POST /package/optic/solve with a domain and a problem.

@jan-dolejsi The above issue is realted to this code. This message is triggered when an unexpected error happens. We definately needs to change it to json format.

Is it easy for you to create a postman request that can trigger this error message? Can you please share the postman file, and I can invesigate why it is happens.

YIDING-CODER commented 2 years ago

Hrmz...looks like a timeout on it:

image

Have one for optic?

@jan-dolejsi @haz The default time out is 20seconds. If the solver is not finshed in 20s, it will returns the timeout. Paas is using celery to manage the task. If the server is busy, the task will be in the waiting list. I don't think it's caused by the server issue. I will run a test soon.

haz commented 2 years ago

The server (flower interfaces) is extremely slow, so there may be something up...

nirlipo commented 2 years ago

I tried to ssh and couldn't, so I started digging through the logs. This is the first thing I get while trying to connect through a virtual console through the management system:

image

It seems we are not enforcing the mem-outs correctly, or we are not killing the process. See mem-out Optic-clp messages.

nirlipo commented 2 years ago

@YIDING-CODER, I manage to ssh, and htop shows that the optic runs were not killed properly: image

haz commented 2 years ago

Ouch! If we're going to put this in front of hundreds of students, the failsafe will need to be rock solid :P. Individual planner resource limits obviously need a check, but I reckon two further things should be in play:

  1. The ability to do a hard reset on the full thing (dock down and up) -- this should probably be at the provider level. Equivalent to heading into heroku and restarting the dyno.
  2. Having the above happen when things are becoming a hot mess. The existing solver seems so sturdy because if things get away from it, it just grinds to a halt and triggers a re-deploy. I reckon we'll need that behaviour too (otherwise it's too much hand-holding :P)
YIDING-CODER commented 2 years ago

@haz @nirlipo I think I misunderstood the meaning of --max-memory-per-child for celery? Currently, it is set to 12 MB.

Below is the documentation for max-memory-per-child: --max-memory-per-child Maximum amount of resident memory, in KiB, that may be consumed by a child process before it will be replaced by a new one. If a single task causes a child process to exceed this limit, the task will be completed and the child process will be replaced afterwards.

I think the child here is refering to worker. Currently we have one celery worker runing on the server(Nir has asked me to add more, but I havn't make the changes). If the single task(like Optic domain problem) exceed the 12MB memory usage, the task will be completed and the worker will be replaced afterwards. For somehow, it is end up with the out of memory issue.

Also, I found an example to prove this behavior - celery - will --max-memory-per-child restart workers instantly or allow them to finish the task?.

Actions

haz commented 2 years ago

I'm unclear with the celery queue stuff. But from the SO post you link to, it sounds like we need something more aggressive at stopping processes that exceed the memory. They shouldn't be allowed to complete (some planners are good at staying within their memory limits or stopping if they can't, but not all of them).

I think the memory we allow for a single thread/worker comes down to the resources we have available. @nirlipo , what are the limits looking like?

For the record, 12MB is way too low. Probably 1Gb or 2Gb will be the sweet spot.

nirlipo commented 2 years ago

It seems that it is freeing the worker to process the next job, but not killing the process.

The processes left running not only exceed the memory limit, but they also exceeded the time limit: 20s. It seems that if the max-memory limit is reached before the 20s, then the process is not being killed. I say this because I tested bfws over a hard sokoban instance that takes >> 20s, without reaching mem limit, and it was killed correctly.

In terms of specs, we currently have

RAM: 64GB VCPUs: 16 VCPU Disk: 30GB

I'll manually do a hard reboot so we can investigate further how to avoid this issue. It is currently too slow to work with ssh.

The system is up again (see Flower service) but now we have 12 workers and 4GB of mem-limit,

The pending action item is:

haz commented 2 years ago

Can we wrap things in extra command-line limits?

nirlipo commented 2 years ago

Certainly! As an example of simple code that can be reused, we can use the benchmark.py file to wrap all the system calls. We just need to use the run function, and it will impose memory and time limits.

It's the code we use in LAPKT: https://github.com/LAPKT-dev/LAPKT-public/tree/master/examples/agnostic-examples, and I believe it was inspired by how FD used to limit their processes. See how the run function is used by looking at the run_planner.py file.

jan-dolejsi commented 2 years ago

Planning-as-a-service client - preview

For now the packaged planning-as-a-service url needs to be added as an additional planner.\ There is only one planner that supports additional arguments. Note that Kstar does not come down in the /package list. But for such planners that support additional switches, the arguments may be sent from VS Code using a temp file that is displayed while launching the planner.

Planning-as-a-service

This is now getting released in the VS Code extension for PDDL version 2.23.2.

There are number of issues that I can see in the planner responses and in the interface. Do you want me to submit them as separate issues in this repo, so we can decide about each of them?

nirlipo commented 2 years ago

That's great @jan-dolejsi, I'll aim to test the new version of the vscode plugin embedding it in an assignment around September, then we'll truly see if the server manages a high load :)

It would be great if you can add all the issues so we can include them in future development plans.

haz commented 2 years ago

Very awesome indeed! The idea with the new online editor plugin was to use the arguments in the payload to dynamically build the input form. Maybe eventually vscode could do the same? I.e., create the right dialog elements depending on the arguments and their types.

Would echo @nirlipo 's advice -- individual issues would be great to keep track of things.

jan-dolejsi commented 2 years ago

Very awesome indeed! The idea with the new online editor plugin was to use the arguments in the payload to dynamically build the input form. Maybe eventually vscode could do the same? I.e., create the right dialog elements depending on the arguments and their types.

Yes. Eventually. A panel with the auto-generated UI is also possible.

I will submit the issues I can see. Mainly it is about the semantic of the planner input and output. Plus some weird errors.

YIDING-CODER commented 2 years ago

Certainly! As an example of simple code that can be reused, we can use the benchmark.py file to wrap all the system calls. We just need to use the run function, and it will impose memory and time limits.

It's the code we use in LAPKT: https://github.com/LAPKT-dev/LAPKT-public/tree/master/examples/agnostic-examples, and I believe it was inspired by how FD used to limit their processes. See how the run function is used by looking at the run_planner.py file.

@haz @nirlipo I have tried to apply the Resource limit on the subprocess call,here is the code. (I found this example online), but it never succeeded. If I set the memory too low, the subprocess won't return anything. If I set it to 30MB, I got fatal error: failed to reserve page summary memory\n message. Not sure what's going on here and I can't find any useful information on Google.

Screen Shot 2022-07-28 at 1 40 00 am

I feel like this is a little bit complex now, as it involves docker, cluster, API decorator, and subprocess.

haz commented 2 years ago

Setting it too low likely breaks in weird ways because it can't even get off the ground. What happens when you throw large problems at a 500Mb limit?

It's a problem we need to solve, since anyone could bring down the entire system by just trying to solve a few complicated problems. This not only is a theoretical risk, but happens all the time with the deployed solver.