CodeSpaceHQ / MENGEL

A framework that applies machine learning algorithms and automates the process of finding the right algorithm for the job.
6 stars 1 forks source link

Separation of Responsibilities for Data Prep #98

Closed isaac-gs closed 8 years ago

isaac-gs commented 8 years ago

Hey everyone, so for the most part we've decided that data cleaning and prep tasks should go in the DMZ for each worker to use. However there is one more issue that I'd like to talk about.

What about basic cleaning and splitting of the data for training? Did we cover this? On the one hand, we don't want to repeat work. On the other, we don't want to take too much away from worker flexibility.

Example,

One last thing. If we have workers do their own data prep and cleaning, should we also have it so they can request tickets with specific models or data configurations for less repeated work? I realize that is a long-term concern, but it should be considered.

Thoughts? @asclines @RyanMcBerg @ZakeryFyke @telelu03

asclines commented 8 years ago

I think the hub should have initial data modifications as there might be some preprocessing that needs to be done. In addition to what @ASAAR has mentioned, another thing the hub could handle is defaults for empty data

isaac-gs commented 8 years ago

Yeah, I can agree with that. We'll just need to remain careful. Do we also want to split the data on the Hub or no? It would mean that every model is getting the same split. I don't know if that's a good or bad thing but we could always change it latter (as long as we make it modular enough).

asclines commented 8 years ago

@RyanMcBerg thoughts? whenever i see data splitting I think of you.

asclines commented 8 years ago

Proposed data prep responsibilities: (From 10/30 meeting) ( In order they should be handled)

Hub:

  1. Read data in from file
  2. Merge files if needed
  3. Image prep

Worker:

  1. Missing data
  2. Dimm reduct
  3. Splitting data