What does mean by CRUD for API?

markpyzhov commented 6 years ago

The docs explains how to use CRUD for API, what the API means here?

https://github.com/MallCloud/contracts-api/wiki/api-access#create-api

kgupta15 commented 6 years ago

It is a graphQL request that is sent by a client. There are 2 inputs taken by the client : token & details. The token is for access to the python api (to get user details) and the details are regarding the request. For ex. if you want to create a notebook, you will send the details accordingly (language, ensemble etc). The fields available that can be sent with the query are based on what you are creating (possible choices for now : Notebook, Dataset, APIEndpoint) are here :

Notebook, Dataset, API

markpyzhov commented 6 years ago

Thanks Kapil, sorry for confusing by my question. I understand how to make the request, I just want to figure out what is "create API" means.

kgupta15 commented 6 years ago

The API is another feature that will be added. In this, a buyer can try the final model of an algorithm using an api by sending requests. This doesnt require the buyer to perform all the training and testing before using it. Also, buyer can add multiple algorithms together to create a pipeline using this feature. The create api is the way to create this pipeline, edit and other functions accordingly performing their functions.

kgupta15 commented 6 years ago

Here is the use case:

Each DS (or a group of DS) creates a model. How does a customer interested in that model try that model before buying (they aren't interested in running code, only running the trained model on some datapoints)?
A customer needs a full pipeline for their data. They find multiple algorithms that are for all stages of the output they need. For example, they find 2 algos each for cleaning up data, reducing number of parameters and performing classification for data. Maybe they need more. How do they create a pipeline and test a few datapoints on this pipeline? They wouldn't want to keep contacting the original DS for each algo (this is a very small pipeline and it already has 8 DS involved).
Customer doesn't care about the language used but our platform does. How do we allow customer to try any algo without worrying about the language in which the algo was written ie; how do we make everything language agnostic?
How to help the customers deploy the final pipeline they chose, so that they have to do the least work and we provide everything out of the box?

Solution:

Create REST API for each model and create a proxy system like graphql to support it.

API system for various languages :

Python (mostly deep learning; especially tensorflow) : TF Serving
Python (scikit and other machine learning libs) : Django / Flask
R : Plumber
C++ : Pistache(machine learning libs), gRPC(caffe or other deep learning libs)
JS : express
Go : Go servers, go gRPC

REST API maybe difficult to create : I've found several software to automate the process of API development.

* The tf-serving can be connected to graphql.
* The Spark framework provides API support.
* The R language has out-of-the-box API development support.
* C++ can be connected using gRPC or other REST frameworks.

Will this lead to everything slowing down, & is it scalable : It shouldn't slow things down as in graphql, each api will be separately contacted without any dependencies As much as I have thought by now, it is completely scalable.
Time Taken for this solution : A basic solution can be created within 2 weeks. I developed an MVP on the weekend. The MVP works.
Basic Architecture Info : It follows a serverless architecture system.

This is not the final design. I am iterating over various options with Xing. I will update when I finalize it & add it into the docs.

markpyzhov commented 6 years ago

The API is another feature that will be added. In this, a buyer can try the final model of an algorithm using an api by sending requests. This doesnt require the buyer to perform all the training and testing before using it. Also, buyer can add multiple algorithms together to create a pipeline using this feature. The create api is the way to create this pipeline, edit and other functions accordingly performing their functions.

Does API mean chain of training/testing (or something before/after)?

markpyzhov commented 6 years ago

Solution:

I don't understand why do we need this if we already configured Jupyter Notebook Server, which apply us to trigger any algorithm in any language (we just need to append the installation of new language or new library into Dockerfile).

UPD The "template" approach will be eliminated soon. Then we will get the algorithms inside *.ipynb files and the semantical assignment will be prepended into these as code-cell with the syntax of language, which is used inside ipynb.

kgupta15 commented 6 years ago

No, the API is not chaining training/testing the algorithms. The API is used in order to directly use the models that are created after training/testing is complete.

I don't understand why do we need this if we already configured Jupyter Notebook Server, which apply us to trigger any algorithm in any language (we just need to append the installation of new language or new library into Dockerfile).

Consider that a DS has written a Neural Network based algorithm. Such algorithms have large training/testing time. A buyer is not interested in watching an algorithm train. They just want to send a datapoint and see if the algorithm works according to their requirements or not.

Similarly for all algorithms that are using large amounts of data, their training/testing time can be huge. Triggering Jupyter Notebooks leads to working on that code. We dont want to run the code everytime someone wants to try something.

kgupta15 commented 6 years ago

We will also be able to chain multiple models to allow a buyer to create and test a pipeline of algorithms for multiple stages and ask them to pay per query instead of having to buy the algorithm or ask for access rights by the author and waiting for it to become available.

markpyzhov commented 6 years ago

Well, you are talking about trained and saved model (for example for R is Rdata), right?

kgupta15 commented 6 years ago

Yes. Example API : http://api.cortical.io

markpyzhov commented 6 years ago

Then we already have the process.

kgupta15 commented 6 years ago

Please explain the solution.

markpyzhov commented 6 years ago

Here are two issues:

Why do we name by 'API' trained models? It is semantically not clear.
As you know TF and R (I didn't check other languages, but I'm sure these have it either) have saving/loading commands. So the fastest and most effective (tell me if you disagree):
1. DS uploads ipynb (of algorithm without save/load logic) into Django API. API saves it into GCS.
2. In the UI DS marks variables in the ipynb, which means session or everything else is needed to saving/loading logic and insert canonical assignment. Django API saves these.
3. We trigger a training:
  1. Django sends a command into Jupyter Notebook Server and creates everything it need in GraphDB.
  2. Jupyter Notebook Server downloads ipynb and dataset
  3. Jupyter Notebook Server prepends code with canonical assignment to data-set
  4. Jupyter Notebook Server appends code with saving behavior.
  5. Jupyter Notebook Server executes result-ipynb
  6. Jupyter Notebook Server uploads into GCS:
    1. Saved model
    2. Output prediction matrix (if exists)
    3. Output of an executions (debug and printed text)

Not all steps are completed, but I don't see the problem to do these. Since we have "template" approach we need to use little different logic, but it is mostly follows the flow.

kgupta15 commented 6 years ago

I haven't thought about that yet. Sorry if its unclear and I will change it.
Yes. I am using TF Serving as a part of my solution.

I dont want to trigger training. And there are other beneficial things there too. I will check with you regarding security while working with APIs as I don't know much about that, though.

markpyzhov commented 6 years ago

I dont want to trigger training. And there are other beneficial things there too.

Yes, but before we can get trained model we need to make training. My example of flow probably confuses you, because it explains training flow. But for testing or using it to get future result the process is quiet similar, except these points:

Before the point:

2.iii.a. Django sends a command into Jupyter Notebook Server and creates everything it need in GraphDB.

Will be a moment when administrator or customer uploads trainer model into GCS via Django API (we already have the logic implemented).
Instead of:

2.iii.b. Jupyter Notebook Server downloads ipynb and dataset

Will be "Jupyter Notebook Server downloads ipynb, dataset and trained model."
Before the:

2.iii.d. Jupyter Notebook Server appends code with saving behavior.

Will be "Jupyter Notebook Server prepends code with loading trained model behavior."

kgupta15 commented 6 years ago

Here are the points that the proposal does & I am not sure the current solution reflects them.

How does it do pay per query?
If I need a pipeline of multiple algos (if there is one for cleaning data, one for reducing parameter list, one for training and more intermediate algos for other tasks), how do I make it? How do I do this without having to ask to run the code by DS.
Can a DS upload different versions of the same algorithm (with different data/more or less training/different kinds of datasets)? It would be kinda confusing to have to run multiple trainings before being able to use the models.
If I don't want to train/test algorithm but just want to use the model, is going through that channel worth it? Shouldn't there be a simpler and faster channel that allows just usage of final models created by the DS for people to use (like a final app instead of a beta version in testing).
If a user wants to use the api in production as they have achieved the required result after talking with DS to modify their algo and then create a final model (or they just liked the model to begin with), can the current solution be scaled for their needs? Essentially, if say, I have an android app that needs an ML algorithm. Would I want to use the current solution for accessing the ML algorithm or an api similar to cortical?

The API is for creating a final endpoint, when all training/testing iterations are complete and model is ready for use. If the answer to some/all of these questions points toward API, I think its useful to have. Also, I think current solution is great but the flow for a DS would be just a bit different when using both the current solution and API:

New Flow:

DS creates a new algorithm.
DS performs testing/training of algo on Notebook server.
A model is finalised after doing it all.
DS sends model to API to allow buyer to try it.
Buyer wants additional features and some changes/ different kind of data for training. Buyer and DS iterate over a solution. They both use Notebook Server for all intermediate results.
A final model is decided that buyer is happy with.
Step 4
Buyer wants to use this api in their mobile app and they can by buying the api access.

ehillerbrand commented 6 years ago

We do not want to duplicate what has already been implemented with notebooks but a simple trained model test through an API makes sense. I do not want an elaborate development effort for the API. Mark/Kapil, let's make sure you both are in good communication about what needs to be developed and what has already been developed. @Teskuroi @daemonslayer

markpyzhov commented 6 years ago

How does it do pay per query?

I think that is where C-API supposed for.

If I need a pipeline of multiple algos (if there is one for cleaning data, one for reducing parameter list, one for training and more intermediate algos for other tasks), how do I make it? How do I do this without having to ask to run the code by DS.

Presently we have chains in D-API. It is possible to make Admin choose components of chain and execute it (they just need to be compatible together).

Can a DS upload different versions of the same algorithm (with different data/more or less training/different kinds of datasets)? It would be kinda confusing to have to run multiple trainings before being able to use the models.

For the Jupyter Notebook Server it doesn't matter are these different versions or different algorithms at all. All it need is trained model (if exists), algorithm, and everything which is need as input.

If I don't want to train/test algorithm but just want to use the model, is going through that channel worth it? Shouldn't there be a simpler and faster channel that allows just usage of final models created by the DS for people to use (like a final app instead of a beta version in testing).

Here are only three phases:

Train. Model watches for changes in values of input and output parameters and trains itself.
Test. Model uses its knowledge and predicts (or something else) output parameters by input parameters. The system looks to the real output parameter from data-set and compares it with output parameters from model. It is what can apply us to compare accuracy of different algorithms or different versions of algorithms.
Future result. Model uses its knowledge and predicts (or something else) output parameters by input parameters. It is what the customer wants to get.

Saved model is just a file of data (in some languages as like R), it doesn't keep an ability to make something with that without an algorithm. When someone has trained model he can initiate the original algorithm to use the model to predict on new inputs (in the same structure). In other words "test" and "future" phases are very similar.

If a user wants to use the api in production as they have achieved the required result after talking with DS to modify their algo and then create a final model (or they just liked the model to begin with), can the current solution be scaled for their needs? Essentially, if say, I have an android app that needs an ML algorithm. Would I want to use the current solution for accessing the ML algorithm or an api similar to cortical?

Regarding current architecture here are two options. When customer bought the saved model he need to:

Initiate an execution of the saved model by the algorithm and with data in our API without need to buying of source code of an algorithm.
Buy the source code of an algorithm and insert into system he has.

MallCloud / contracts-api