Closed maxdml closed 6 years ago
What is the ML module? I don't think the machine learning code will provide any APIs, it will be consuming them. From what it sounds like, the backend group will need to figure out what the input to some function like run
will be. I think we need to write something that will feed off the task queue and run a function, perhaps loading a class/module specific to an algorithm.
I think the input to this function could be the full classifier object. It would expect a JSON serializable object or map in response.
results = svg_classifer.run(task.data)
or even
svg_classifer.run(task)
'
Where task could even have functions like task.progress(80, "Fold 4/5 completed")
The ML module is any library used to implement machine learning algorithm. Actually, if I am not mistaken, this library is pandas (please confirm/correct :).
It will receive request from the Asynchronous Task Queue, as you mentioned. However from the (outdated) schema in the meta cognoma repository, and in the schema @cgreene draw last night (if someone could upload it somewhere that would be awesome), we see the need to decompose the ML component to the django backend.
Even if we just forward the full classifier object, each component has specific APIs, which explains why I opened the issue.
@maxdml : I think the machine learning group is primarily using sklearn, though others may use something else. I think that we should define what gets provided to these methods, and each one should get passed the same information. Maybe some of the implementers can let us know what type of information they use. We should also define what we want the algorithm to report at the conclusion of the run.
I think this is more of, what is the schema of the input/output?
If the input is a classifier object, then the only thing not defined yet is the algorithm_parameters
. The parameters will vary from algorithm to algorithm. On the algorithm object, I put a parameters
field to store the parameter schema for algorithm as JSON schema. JSON schema allows us to validate the parameters in both JS and python. There are also angular libs to automatically generate forms and display views based on JSON schema. Swagger, Google, and other API spec formats also use JSON schema.
We should create algorithm parameter schemas for each algorithm. Creating them and storing them as JSON schema inside the algorithms repo would be ideal. We could also have algorithm implementors write out the schema as tables in markdown if JSON schema is too complex.
Here is an example:
{
"title": "SVG classifier parameters",
"type": "object",
"properties": {
"threshold_a": {
"type": "number",
"title": "Threshold A",
"description": "Threshold A controls yada yada",
"minimum": 0.0,
"maximum": 2.0
},
"category_example": {
"type": "string",
"title": "Category Example",
"description": "Category Example yada yada",
"enum": ["blue","green"]
}
}
}
Here's a good guide https://spacetelescope.github.io/understanding-json-schema/
Creating JSON schemas for the output could be useful as well.
Right, it is a question of "what is the schema of the input/output". The reason why I am mentioning an API is that, from my understanding, we are trying to setup a modular infrastructure where the machine learning code is decoupled from the django backend, and the link between them is the ATQ.
Ah ok. So that is a good questions. I think we'll need some sort of client daemon to run the machine learning code in. The client will need to pull classifier tasks off the queue via RESTful HTTP calls to the task-service.
I don't know what the handoff code will look like. Right now, these are all scripts. We could keep them as scripts, the client could pipe in the JSON classifier and expect JSON output when done.
They could be wrapped into python modules, which I think is more ideal. The public functions that the client hits should be standard across all the algorithm modules. I think this is what @maxdml is interested in. Wrapping them as modules seems more ideal to me. I think the API could be fairly simple. Like a suggested above, just a run
function.
We may also need to write a wrapper lib for the Cognoma API or pass a module as an argument. This is if they need to access data from the primary database directly.
I would like to see a task.progress
like function to report progress since these will be long running. The progress function could also touch the task in the task-service so that it know the worker is alive.
Right, I think the public functions should be consistent across all ML modules. Once we are fixed on a first API proposal, I would like to implement a simple containers setup which would look like the following:
The goal is to start thinking "deployment". Having independent containers will greatly simplify automated testing (with Jenkins for example), and help providing mock containers to each team (e.g provide a mock ML container to the backend team, and a mock ATQ container to the ML team).
@maxdml Sure. I don't know if the daemon code will live in this repo or another. I think it could be in another repo or be a directory within this one. From a deployment standpoint the daemon will probably be a python script running inside of some sort of process manager like pm2. We could do something like make a job JSON blob or file path an optional argument to the daemon script to run jobs manually in the terminal for dev/testing.
@dhimmel Are the scripts like https://github.com/cognoma/machine-learning/blob/master/algorithms/scripts/SGDClassifier-master.py the final deliverable from the ML team? These are written for ipython notebook and write output and graphs to the shell. Can we ask the ML team to write these or a version of them as python modules meant to be run as application tasks? The backend team could provide some scaffolding. I think it could be boiled down to just a run
function.
Quick question while we're on the topic - does it make sense to use something like Celery for this? It does integrate with django and then we could require each method to provide a task definition with standard parameters & behavior expectations.
@cgreene It might, but that would mean an architectural change. I would have to stop working on the task service and we may need to collapse all the backend code to a single repo.
I do I like the idea of storing the parameters in a portable data structure like JSON schema more. Then we would be able to:
The hand off from the ML implementer may not need to be JSON. YAML or even something as simple as markdown table that gets manually translated or a script could convert it.
Gotcha - as long as we have considered it and had use cases that it doesn't solve. :)
On Mon, Aug 15, 2016, 6:29 PM Andrew Madonna notifications@github.com wrote:
@cgreene https://github.com/cgreene It might, but that would mean an architectural change. I would have to stop working on the task service and we may need to collapse all the backend code to a single repo.
I do I like the idea of storing the parameters in a portable data structure like JSON schema more. Then we would be able to:
- Validate the algorithm parameters in JS and server-side in the REST API call
- Generate input fields automatically https://github.com/json-schema-form/angular-schema-form
- Generate parameters section of the job summary, and any other part that needs to display them.
- Might also be able to generate docs.
The hand off from the ML implementer may not need to be JSON. YAML or even something as simple as markdown table that gets manually translated or a script could convert it.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/cognoma/machine-learning/issues/31#issuecomment-239948855, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhHs0Gi0DUzIxqSIfOrJsY6Q9MGeHpmks5qgOhCgaJpZM4JiXng .
Ok. I'm proposing this.
Each algorithm would need a module that exposes:
definition
- a dictionary defining the algorithm with name, title, description, and parameters schema.run
- a function that takes a task instance. This will contain the classifier input data. We could also pass the classifier object and/or algorithm parameters separately if that makes it easier.I'd also like the task instance to contain that progress
function I mentioned before. It may not be necessary but could really help communicate what's going on to the user other than it's been running for X amount of time. Though, run histories form the task table may eventually be able to be used to predicate how long it will take.
A script would exist in this repo that could generate the algorithms
table using the info from each module.
The daemon for ML tasks would live in a directory in this repo. It could be run as a daemon connected to the task-service or as a script in the terminal that does one off jobs manually for development purposes.
We could create an example module and documentation to assist in creating a module.
@awm33, Currently, the notebooks we're creating are solely for prototyping (although we may actually want to provide users their custom notebook). The scripts exported from the notebooks are solely for tracking, not for execution.
The machine learning team will write a Python module with a run
function. The run
function will have a parameter for a JSON input (either the raw JSON text, the loaded object, or the loaded object split into **kwargs
). The input JSON will contain a sample subset array, a gene subset array, a mutation status array, and an algorithm string. Here's an example payload:
{
"classifier": "elastic-net-logistic-regression",
"expression_subset": [
1421,
5203,
5818,
9875,
10675,
10919,
23262
],
"mutation_status": [
0,
1,
1,
0,
0
],
"sample_id": [
"TCGA-22-4593-01",
"TCGA-2G-AALW-01",
"TCGA-3G-AB0O-01",
"TCGA-3N-A9WD-06",
"TCGA-49-4487-01"
]
}
For expression_subset
, "all"
is also a valid value corresponding to when the user does not want to subset the genes whose expression is used as features.
I made a few design choices above which I'll explicitly state now.
ridge-logistic-regression
, lasso-logistic-regression
, and elastic-net-logistic-regression
may all use the SGDClassifier
in sklearn
, but will appear as three algorithm options. The machine learning team will thus create an algorithms table/list that the frontend/django-cognoma can consume.In the future, we may want to add more options to the payload, such as a transformation
or variable_selection
arguments.
CCing @RenasonceGent who was interested in these topics at the last meetup.
I was waiting on more examples to be completed before moving on. I added a gist here with the code I have so far. I broke up the example script into functions. It should make it easier to change things later. I can see having a default set of hyperparameters to start, but isn't that something that we should make optional for the user later? I wouldn't expect the optimal set of hyperparameters for one set of data to be even be near optimal for another.
@dhimmel I am a bit of a layperson when it comes to the actual meaning of the ML computations. Do you know where I can learn more about the "outcome array of mutation status"?
The reason I want to understand this is that I am concerned about business layer related computation being implemented in the frontend. If some outcome of the ML algorithm has to be further refined before being delivered to an user, shouldn't it be kept in the ML module?
Sorry if the question looks dumb.
Do you know where I can learn more about the "outcome array of mutation status"?
@maxdml, All this means is whether a sample is mutated (0
coded) or normal (1
coded) for a user-specified gene (or set of genes). The goal of the machine learning model is to learn how to classify samples as either mutated or not-mutated. Outcome is maybe a confusing term here -- it refers to what the model is trying to predict and must be available before the model can be trained.
Just so we're clear, the goal is to use gene expression (also referred to as features/X/predictors) to predict mutation status (also referred to as outcome/y/status). If you have any general questions about machine learning, perhaps ask them at https://github.com/cognoma/machine-learning/issues/7.
@dhimmel
The frontend/django-cognoma will not be able to pass algorithm hyperparameters.
Is that because of a limitation of the frontend/backend, or that just doesn't seem necessary for the user?
Looking at the classifer object in https://github.com/cognoma/django-cognoma/blob/master/doc/api.md.
genes
maps to expression_subset
. Is expression_subset
a better name? I'd rather use one field name consistently. Couldn't you infer empty [] equals "all"? Seems like bad data modeling practice to mix them, we could add a boolean field to mean "all" if you need something explicit.tissues
is only used by the result viewer?mutation_status
This is the outcome array generated from the sparse matrix? Not the full one?sample_id
Is this just a list of samples that connect to the outcome array? Not all (7.5k) sample ids?The machine learning team will thus create an algorithms table/list that the frontend/django-cognoma can consume.
How do they want to maintain this? Where the table be a SQL table? We could just write a script to generate/update if we add a couple fields to the modules, like a user friendly name/title and description.
The frontend/django-cognoma will be required to compute the outcome array of mutation status.
Can you or someone from the Greene Lab create an issue describing how to calculate / create the outcome array?
Another thing is logging. Will the ML group be logging using the python logging module? We may want to send the logs to disk and/or something like logstash.
If the module code could periodically hit a progress
function, that would be great. It could report progress, and most importantly we could know if it's stuck, otherwise it will have to wait for a long timeout.
Is that because of a limitation of the frontend/backend, or that just doesn't seem necessary for the user?
Choosing hyperparameter values is a great hindrance. If the machine learning team does our job, we can hopefully not subject the user to this nuisance. In the future if some users want direct access to setting hyperparameters, perhaps we can expand functionality.
Regarding the schema, which I hadn't actually seen yet (nice work) -- I think we're on the same page, I am just envisioning the simplest possible system.
tissues
is used to identify the relevant sample_id
set, so we will need either the sample_id
array or the tissues
array.genes
is the same thing as expression_subset
. I wanted to be more specific than just "genes" since mutations also are in genes. expression_genes
is a possibility.mutation_status
-- it looks like the current api docs are missing a field for the outcome (y) for the machine learning classifier.How do they want to maintain this?
For the list of algorithms, we'll export either a TSV or JSON file with the algorithm
, name
, and description
.
We can log and hit the progress function. @awm33 -- let's deal with these two issues later.
@dhimmel Cool.
tissues
on the classifier model, since will want to store that as the user selection. It sounds like the sample id list is based on something like select sample_id from samples where tissue in (tissues)
. That could be done during task queueing, inside the job outside the ml.run, or inside the ml.run function. Perhaps using some shared function. I'm leaning towards doing it in the task.expression_genes
or gene_expression_set
?mutation_status
yep, that's not there, I didn't know what it would look like or how to construct/calculate it.I think I'd like to keep it as
tissues
on the classifier model, since will want to store that as the user selection.
Sounds good. In the future however users may want to select samples based on other criteria then tissue. We can always change this then.
expression_genes
orgene_expression_set
?
Don't care, but if we use set, then should we use tissue_set
as well?
mutation_status
yep, that's not there, I didn't know what it would look like or how to construct/calculate it.
If we don't store the sample_id
array, we will need to store a formula for how to compute mutation status. Will start a separate issue for this.
@dhimmel
Sounds good. In the future however users may want to select samples based on other criteria then tissue. We can always change this then.
Ok, It sounded to me like tissues is a filter on the samples. If there are more filter criteria, I think we should still store it for the state of the UI and knowing how the user generated the list.
If we don't store the sample_id array, we will need to store a formula for how to compute mutation status.
I think we need more clarification on where mutation_status
and samples_ids
are coming from. Maybe that can be in your issue. The model does have mutations which connects genes to samples on a many-to-many. Are mutation_status
and samples_ids
just entries for each selected gene's mutation status on each sample? So if you choose 10 genes, you would have 10*number-of-samples entries? If that's true, then is number-of-samples the full 7.5k or the number of samples matching the tissue filter?
For the minimum viable product (i.e. the first release of the machine-learning package), I'm thinking we can have a simplified input to machine-learning (ML). The main ML function would consume a JSON file with a mutation_status
and sample_id
array. Do people prefer mutation_status
or mutation_statuses
/ sample_id
or sample_ids
? Also we could encode the information as a sample_id_to_mutation_status
object.
{
"mutation_status": [
0,
1,
1,
0,
0
],
"sample_id": [
"TCGA-22-4593-01",
"TCGA-2G-AALW-01",
"TCGA-3G-AB0O-01",
"TCGA-3N-A9WD-06",
"TCGA-49-4487-01"
]
}
Based on this design choice, the ML module never gets passed information on which sample filters were applied (such as disease type, gender, or age). While this information should be stored, the ML portion of the project won't actually need to know this information.
@awm33 I know I didn't answer your questions, just let me know which ones are still outstanding.
@dhimmel Looking at the your original example from above
{
"classifier": "elastic-net-logistic-regression",
"expression_subset": [
1421,
5203,
5818,
9875,
10675,
10919,
23262
],
"mutation_status": [
0,
1,
1,
0,
0
],
"sample_id": [
"TCGA-22-4593-01",
"TCGA-2G-AALW-01",
"TCGA-3G-AB0O-01",
"TCGA-3N-A9WD-06",
"TCGA-49-4487-01"
]
}
Is this data related/tabular? So the above could be written as:
[
[1421,0,"TCGA-22-4593-01"],
[5203,1,"TCGA-2G-AALW-01"],
[5818,1,"TCGA-3G-AB0O-01"],
[9875,0,"TCGA-3N-A9WD-06"],
[10675,0,"TCGA-49-4487-01"]
]
or
[
{
"expression_subset": 1421,
"mutation_status": 0,
"sample_id": "TCGA-22-4593-01"
},
{
"expression_subset": 5203,
"mutation_status": 1,
"sample_id": "TCGA-2G-AALW-01"
},
{
"expression_subset": 5818,
"mutation_status": 1,
"sample_id": "TCGA-3G-AB0O-01"
},
{
"expression_subset": 9875,
"mutation_status": 0,
"sample_id": "TCGA-3N-A9WD-06"
},
{
"expression_subset": 10675,
"mutation_status": 0,
"sample_id": "TCGA-49-4487-01"
}
]
I'm just assuming since a sample is mutated for a specific gene, that's what we are trying to pass, a row per sample.
sample_id
and mutation_status
are two columns of a single table. expression_subset
is of a different nature (and will not be included in the MVP implementation). Therefore, we could use any method for representing the sample/observation table (containing sample_id
and mutation_status
and possible more columns in the future). Let me know what you think is best.
Looking at https://github.com/cognoma/machine-learning/pull/51
Would it make sense to have the worker/ task runner code in this repo or in a separate one?
I was thinking of exposing it as a cli like
python ./task-runner.py run-task ./some/path/task.json
for local machine testing/development
python ./task-runner.py worker
which would start a worker and be run as a daemon in prod
The worker code would from cognoml.analysis import classify
and run the classify fn in the task process, passing the specific task data.
Would it make sense to have the worker/ task runner code in this repo or in a separate one?
My preference is a separate repo. This repo already contains multiple things. Also I think people may be interested in using the cognoml
package without the task runner.
The task runner environment can install the cognoml
package using pip install
, either from specific commits on GitHub or if we upload cognoml
to PyPI.
@dhimmel That sounds good, do you want to create the repo? Trying to think of a good name, maybe "task-workers" or "ml-workers" in case we have other background tasks.
That sounds good, do you want to create the repo? Trying to think of a good name, maybe "task-workers" or "ml-workers" in case we have other background tasks.
@awm33, you pick the name and I'll create. I like both the suggestions. How will this repo be different than task-service?
@dhimmel ml-workers
works then. The repo will house code consuming that service and the core API, but is not part of the service itself. It's the "Machine Learning Worker(s)" in the architecture diagram.
@awm33 I created ml-workers
-- see https://github.com/cognoma/ml-workers. You're the maintainer.
@dhimmel Thanks! 🌮
Hello,
In an effort to build the global cognoma architecture, it would be very useful to determine an API which defines exactly what is given to the ML module (and incidentally what it will return).
As an exemple of strong API documentation, I believe OpenStack is a good start. Note how every module's API is listed, and how for each of those modules each route is described.
Some direct example for a cognoma API can be found here. This is a first specification for the frontend module.