FireCARES / research

1 stars 1 forks source link

ML Questions and Thoughts #3

Open chopchop505 opened 5 years ago

chopchop505 commented 5 years ago
  1. Which AWS Account should we deploy (FireCARES/StatEngine)?

  2. What is the preferred dump format for training data from Elasticsearch. The easiest is a line delimited JSON file via elasticdump. It would be an entire dump (all fields, all departments). In your notebook (in the section Upload the data for traning section), you could then retrieve this data dump and do pre-processing/subsetting for the model in question.

  3. Do you want to continuously updated training data? If you'll be tweaking models frequently, is it best practice to use the same static training data set or continuous add to the training dataset. Doesn't matter to me, we can export up to daily, but that might be overkill.

  4. Do you plan on using different models for each departments, or single model that takes the FireCARES ID as a heavily weighted feature? A single model obviously makes deployment easier, but probably complicates the model significantly (I don't know enough about ML).

  5. For deployment, its cheaper to do batch predictions, but easier to do on-demand (especially for future models). Not a question, but just something we should chat about.

  6. We probably want to think about how to manage multiple experiments sooner rather than later. See example here: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/search/ml_experiment_management_using_search.ipynb

This was a great example, and we could literally have this in production tomorrow!

Deploying your custom model is going to take a bit of lifting (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html), but should be doable. It may be easier to translate your model to a supported framework like SciKitLearn/TensorFlow, instead of building a custom container?

chopchop505 commented 5 years ago
  1. @garnertb

  2. Joe to load into pandas dataframe

  3. When enough new data is available. For incidents per day - maybe every couple of months. Depends on the model.

  4. Assume 1 model per department for simple models like this one

  5. Probably best to assume batch jobs to save $