REST API - Githubissues

jobliz commented 6 years ago

A REST API would be useful to access ATM from any other language and device through the web, and also to illustrate how the code is structured and might be extended.

From the project's readme I get that the internal API's are going to change and that it consequently might be a bit early to develop a REST API, but I wanted to see what was possible to do with the current code. The API currently serves:

Various GET endpoints for reading data from the four entities that are currently present in the database
1 GET endpoint to run the worker.py script inside the virtualenv as a subprocess and retrieve it's stdout and stderr
1 POST endpoint to send a .csv file with the HTTP request, save the file to the atm/data directory and run enter_data on it.

No modifications were made outside of the rest_api_server.py file except to the requirements.txt file, adding flask and simplejson to the project dependencies.

TODO's / caveats:

No AWS integration
api.py currently does not check if the uploaded filename is already present, so a CSV upload can rewrite a previously sent file with the same name. This needs fixing, but I thought that I should ask first if the atm/data directory is the right one to put new files in, and if storing UUID's is ok with the project's design before using them and doing a pull request.

Example api.py usage:

After following the readme's installation instructions and running python scripts/rest_api_server.py on a separate shell under the virtualenv:

curl localhost:5000/enter_data -F file=@/path/file.csv

It should return:

{"success": true}

To see the created dataset:

curl localhost:5000/datasets/1 | jq

{
  "class_column": "class",
  "description": null,
  "size_kb": 6,
  "test_path": null,
  "k_classes": 3,
  "majority": 0.333333333,
  "d_features": 6,
  "train_path": "FULLPATH/file.csv",
  "id": 1,
  "n_examples": 150,
  "name": "file"
}

To see the created datarun:

curl localhost:5000/dataruns/1 | jq

{
  "status": "pending",
  "start_time": null,
  "description": "uniform__uniform",
  "r_minimum": 2,
  "metric": "f1",
  "budget": 100,
  "selector": "uniform",
  "priority": 1,
  "score_target": "cv_judgment_metric",
  "deadline": null,
  "budget_type": "classifier",
  "id": 1,
  "tuner": "uniform",
  "dataset_id": 1,
  "gridding": 0,
  "k_window": 3,
  "end_time": null
}

To run the worker.py script once:

curl localhost:5000/simple_worker | jq

after a while

{
  "stderr": "",
  "stdout": "huge stdout string with worker.py's output"
}

bcyphers commented 6 years ago

First of all, thanks so much for working on this! Really fantastic stuff. A REST API is something that's always been on the back of my mind, but wouldn't have had time to get to it.

I don't see any problem with using UUIDs. And since we don't have AWS working yet anyway, that's obviously not an issue.

I'll ask @kveerama, but I think this is something we want in the main repo. If you create a pull request, I'll try to review it this weekend and have some more feedback. It would also be very helpful if you could write a couple of tests for this.

Also, FYI: I was a full-time maintainer of this project for a while, but I've taken another job and will no longer be able to devote much time. I'll keep up with the repo, but updates may come much more sporadically than they did the past couple of months. I believe the lab is trying to hire another maintainer, but until that happens, feedback may be slow.

Thanks again for contributing!

jobliz commented 6 years ago

Hi! I recently looked at my code again and realized that of all things I had forgot to do the PR here. Just did it. I'll look into adding some tests too.

RogerTangos commented 5 years ago

Hey there - I'm picking up this issue. I think @jobliz's design was pretty good, and I'm going to propose a few more endpoints and decouple the API from the database

RogerTangos commented 5 years ago

Here are some more thoughts. I'd really appreciate feedback and other insights into this.

A. I started off using @jobliz's existing REST API. However, since I would have very much like to force database access to go through the classes, this would require adding to_json methods to the following classes: Database.Dataset Database.Datarun Database.Hyperpartition Database.Classifier Worker B. This would also mean making some edits on worker for starting/stopping/inspecting. C. This is a good option and I can pursue it if that's what people want. D. However, mapping out the relationships between the four main tables, I'm beginning to think that using the graphQL specification may be a better bet. It would certainly make for simpler frontend client development. E. This would allow people to make queries like the following:

type Datarun {
  id: Int
  name: String
  dataset {
    name: String
  }
  classifiers {
    id: Int
    cv_judgment_metric: Float
  }
}

F. I'm not very familiar with graphQL, but a co-worker likes it, and it seems like its sticking around and may replace REST. Given that I'm going to need to do work in the Database subclasses anyway, it doesn't seem unreasonable. G. There are some other options also detailed below. If you have thoughts on these, please share them. I'd like to start working on the API this week.

GraphQL implementation (detailed above)
REST implementation without direct database access (detailed above)
REST implementation with direct database access
- This would roughly be roughly what's already on the rest-api branch, with just a few tweaks for starting/stopping workers
Another option I haven't thought of

Please let me know what you think.

If I did do a REST API, here are the endpoints I think would make sense

GET summarize datasets table POST to upload a file /datasets

GET information about a particular dataset /datasets/<int:id>

GET info about all the times we ran ATM /dataruns

GET info about one particular time we ran ATM /dataruns/<int: id>

GET all classifiers that were trained in dataruns /classifiers

GET single classifier that was trained in a datarun /classifiers/<int:id>

GET dataruns on a given dataset POST to create new datarun on a dataset /dataruns/dataset/<int:dataset_id>

GET classifiers for a given datarun POST to train a new classifier on the datarun (possibly) /classifiers/datarun/<int:datarun_id>

GET information on all workers POST - Create a worker? /worker

GET - get information on a single worker POST - Create a worker? I'm not sure this makes sense here. PUT - Pause/Start the worker DELETE - Destroy? /worker/<int: id>

GET a hyperpartition note: values are currently base64 encoded and the API should make a note of this /hyperpartions

GET a single hyperpartition /hyperpartitions/<int:id>

GET the hyperpartitions used in a datarun /hyperpartitions/datarun/<int:datarun_id>

GET the classifiers using a hyperpartition /classifiers/hyperpartition/<int:hyperpartition_id>

Changed from Joblit PR: /enter_data /simple_worker

RogerTangos commented 5 years ago

I'm posting discussion from our slack here so that it's available to public contributors

from @csala to @RogerTangos Given that you are the only one who seems to have any experience with graphQL, and that you said that you are not very familiar with it, I would discard graphQL.

I think that REST will completely satisfy our needs, and it's something that most people is already familiar with.

RogerTangos commented 5 years ago

from @csala

A part from it, a couple of comments about the API design:

In general, when building a REST API I follow the following pattern for each possible "object" (e.g. dataruns, datasets, etc.)

Getting all the objects is done in the base path: GET /object_name
Getting multiple objects but not all, is done in the base path, filtering the query string: GET /object_name?condition1=value1&condition2=value2
Getting a single object by ID is done in the base path + id: GET /object_name/<object-id>
Put (which creates a NEW object) is done on the base path: PUT /object_name + Object Body
Post (which updates an object) is done on the base_path + id: POST /object_name/<object-id> + Object Body
Delete is done on the base_path + id: DELETE /object_name/<object-id>

So, for example, I would implement this as follows:

Getting all the dataruns: GET /dataruns
Getting all the dataruns for a given dataset: GET /dataruns?dataset_id=<dataset-id>
Getting a single datarun, by id: GET /dataruns/<datarun-id>
Creating a new datarun: PUT /dataruns + JSON indicating the datarun details

A part from that, I would also implement DELETE in some cases, where it makes sense:

If adding "static" things like a dataset is possible, I think that deleting them should also be. So, I would also implement DELETE /datasets/.
Similarly, should we also allow deleting other things like dataruns as a "cleanup measure"?

Another example of the schema explained would be, to "train a new classifier on a datarun", instead of posting to /classifiers/datarun/<int:datarun_id>, I would use:

POST /classifiers
{
    "datarun_id": "datarun_id",
    ...
}

RogerTangos commented 5 years ago

from @RogerTangos

Good critique. Thanks @csala!

I think we should allow for deleting dataruns. TBH, I haven’t spent as much time on methods as I’d like to. (I was trying to get that out before leaving work) Tomorrow’s [Wednesday October 4] a holiday in Germany. I may do some more stuff then, but will push out a new draft for Thursday

@micah - responding to the ATM-Vis stuff first: I’ve taken a look at their API, and although it does cover more usecases, I don’t think it’s nearly as well thought out as what joblitz put together in the PR.

RogerTangos commented 5 years ago

from @RogerTangos The ATM-Viz group does have some other shim code, which is good to look at, but I think should be integrated directly into the classes themselves. For graph QL, my reasoning is that it’s generally a much more expressive syntax. Instead of having to hit multiple endpoints, the client can specify exactly what they want and (in most cases) receive exactly that. This lends itself to better stats on what clients actually want and to faster development.

I went through their getting started guide. Once under the hood, it seems surprisingly simple. There are methods that would need to be defined on the classes, which need to have some relationships to eachother. Those relationships are already explicit due to SQLalchemy, so my impression is that it’s an equal or slightly larger amount of work for a significantly easier to use client experience.

RogerTangos commented 5 years ago

from @RogerTangos to @csala I see your reasoning on going with something familiar. (REST has been around forever and is also quite familiar to me.) Will you have time to look at the graphQL docs sometime soon? I’d be interested to hear your impression after taking a look.

Because of your feedback, I’m leaning towards just building a REST api, but I’d like to make that decision based on more than familiarity.

(I really unhappy with the fickleness of front end development, and actually dragged my feet on looking at graphQL because of that. After having a look though, I’m realizing it’s a specification - not an implementation - and I think it’s likely to stick around.)

RogerTangos commented 5 years ago

from @micahjsmith

albert [@RogerTangos] I looked into GraphQL based on your recommendation and it does look really great. If I understand correctly, the main use is the endpoint where someone wants to get all classifiers associated with a datarun. Then, we would specify an allowed query using graphql on the server that returns, for a given datarun, any info about the dataset and classifiers that the client requires. The REST alternative would be to first make a GET /dataruns/1 and then make a series of GETs to /classifiers/1, etc.?

</copypasta>

RogerTangos commented 5 years ago

Hey @micahjsmith , thanks for taking a look. That's correct. It would also allow the client to extend that query to give metrics about the classifier.

The alternative is something similar to the endpoints that @csala and I have been discussion above. (I'm editing those today.)

@csala have you had a chance to look at graphQL?

micahjsmith commented 5 years ago

I'll let @csala weigh in and it seems like the two of you have more expertise than myself in this type of question. But if it seems that both approaches will work, and you are the one who is putting in the time to implement this feature, I'd think you could use your best judgment.

RogerTangos commented 5 years ago

New endpoints. I'm working on implementing these

GET summarize datasets table /datasets

GET information about a particular dataset POST to create a new dataset DELETE to remove a dataset PUT to replace a dataset with the current /datasets?id=<int:id>&name=<str: name>&description=<str: description>

note: Further down, this is abbreviated as follows:

/datasets?<int:id>&<str: name>&<str: description>

GET info about all the times we ran ATM /dataruns

GET info about one or more dataruns DELETE to remove a datarun /dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stop a datarun /dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

GET all classifiers that were trained in dataruns /classifiers

GET one or more classifiers DELETE one or more returns link to download model(s) /classifiers?<int:id>&<int:datarun_id>&<hyperpartition_id>

GET information on all workers POST to create a worker /worker

GET information about a single worker, including the status DELETE a worker PUT to replace a worker PATCH to update a worker /worker?<int: id>?<int: dataset_id>

GET /hyperpartitions?<int: id>&<int: datarun_id>&<str: method>&<str: status>

PATCH to Disable/Enable for further search /hyperpartitions?<int: id>&<int: datarun_id>&<str: method>&<str: status>

micahjsmith commented 5 years ago

A couple questions:

GET info about one or more dataruns
DELETE to remove a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stop a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

Why not just provide the datarun id? Why is tuner a query parameter? Would we stop dataruns with a tuner matching some string?

PUT to update or replace a worker
PATCH to update/modify a worker

What is the difference here?

PATCH to activate or deactivate the worker
/worker?<int: id>?<int: dataset_id>/status

Thought we discussed that this is not currently possible or relevant

RogerTangos commented 5 years ago

Thank you for taking a look!

GET info about one or more dataruns
DELETE to remove a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>

PATCH to stops a datarun
/dataruns?<int: id>&<int: dataset_id>&<str: description>&<str: tuner>&<str: status>/status

Why not just provide the datarun id? Why is tuner a query parameter? Would we stop dataruns with a tuner matching some string?

I'm trying to provide a maximal ability for users to search for a datarun here. I'd also like to include <str: selector> and other parameters in the datasets table. I should have noted that users could stop or start dataruns based on the query string.

Do you think it'd be better to limit searching to just the ID?

PATCH to activate or deactivate the worker
/worker?<int: id>?<int: dataset_id>/status

Thought we discussed that this is not currently possible or relevant

Good catch. We had. I just forgot to remove those lines when copy pasting. Edited above.

micahjsmith commented 5 years ago

Thanks for your explanations.

Do you think it'd be better to limit searching to just the ID?

Based on your explanations, I don't think it makes sense to limit searching. Though I wonder if it makes sense to limit deletion. I wonder if DELETE /dataruns?tuner=gp could be overly destructive

RogerTangos commented 5 years ago

Note: There are requests from Kalyan that

We be able to provide API keys
We make datasets storable off the main server (probably in s3 buckets) with supplied links.

Both seem doable. I'll make a separate issue for no 2 if it's not already possible

takabayashi commented 5 years ago

Hey Guys,

I am architect and main contributor of [marvin](https://github.com/apache/incubator-marvin). Marvin is a platform that makes easy to create, deploy and manage models. We already have a lot of thinks that delivers APIs and more architectures concerns. This example shows how I did to integrate another AutoML library. The project is very new as an apache project.

My ideia is start to help you guys cause I believe AutoML is the future.

This is one paper that, more or less, show the ideia that supports marvin.

Summarising the benefits of this integration:

Rest APIs to execute action, training, evaluation and predict
S3, HDFS as storage backend to save and versioning models, dataset and metrics
Reload models online base on metrics
Distributed architecture based on actor models

What do you guys think about this?

csala commented 5 years ago

The first read-only version of the REST API is already in master, so I'm closing this issue now so future enhancements can be added as new issues.

HDI-Project / ATM

REST API #82

note: Further down, this is abbreviated as follows: