CodeSpaceHQ / MENGEL

A framework that applies machine learning algorithms and automates the process of finding the right algorithm for the job.
6 stars 1 forks source link

Getting input from user #103

Closed asclines closed 8 years ago

asclines commented 8 years ago

Scenario

Just to make sure everybody is on the same page, here is how I'm imagining the program working from the user's perspective.

While there are a many modules & files in this project, as a user, I expect there to be a single point of entry where I can go and take care of everything.

Problem

As of now, the way we get input is through user input on command line. Personally I do not think this is the best way in the long run for a couple reasons two of which are:

  1. Requires the user to manually type / copy-paste / etc.. repeat input each time they run this program
  2. Data validation happens at runtime while the user is inputting the data and if anything is wrong they have to start all over again. Any way of handling this will involve in some if not all parties involved getting frustrated.

    Possible Solution

I propose a different way of getting the user input. Instead of having the user go through a prompt or something equivalent, the user could have all the settings/configurations etc.. in a single file that they could then feed this file into the program. From this file and this file alone, the program would then be able to get all the information it needs to work and without anymore prompt from the user. The only time the user need get involved after that is for certain errors that occur and final output.

Proposed Implementation

Note: This is a work in progress and it is here that I really want to hear your input.

There are several ways and several file formats that could be used as a configuration file, the way I think it should be done is in XML.

Why XML as opposed to other file formats like JSON or ini?

Mostly for readability and ease of manually editing.
By having it as an XML, the user can easily go through the config file and change the info as they need without getting to worked up on formatting or , : " { } etc.. There are a couple more things but those are going to mentioned in the "Moving Forward" section.

Example

A first draft example I have come up with is this:

<MLTF-Configuration>
  <Project-Name>SE2-KaggleComp</Project-Name>
  <User-Name>asclines</User-Name>  
  <Firebase>
    <Database-URL>FirebaseDatabaseURL</Database-URL>
    <Service-Account>FullPathToServiceAccount</Service-Account>
  </Firebase>
  <Files>
    <File>
      <Path>/path/to/file</Path>
      <Split>
        <Train>100</Train>
        <Test>0</Test>
      </Split>
    </File>
    <File>
      <Path>/path/to/another/file</Path>
      <Split>
        <Train>60</Train>
        <Test>40</Test>
      </Split>
    </File>
  </Files>
  <Prediction>
    <Target>TargetColumn</Target>
    <Type>Regression</Type>
  </Prediction>  
  <Models>
    ..Model overrides go here
  </Models>
</MLTF-Configuration>

Moving Forward

Again, this is a first draft that I threw together just to give an idea. There are a couple concerns I have with it but before I go forward with this idea, I wanted to run it by the team and get y'all's input first.

That being said, there are a couple more things I want to say.

Namespaces & Schemas

If we move forward with this XML plan, we can bring in things like XML namespaces and schemas to help not only with readability, but for forcing certain requirements on various attributes & elements. This will allow the user to do a lot of the data validation themselves before the program is even run, saving everyone some time and pain. In addition, using something like XSLT, we can quickly load, validate and edit the configuration file during runtime.

<Files> ... </Files>

In the example implementation I have a files section. Some things I like about it are the way its generic allowing the user to simply put all the data files they wish to use here.
The way its currently set up, the ordering is "File" -> "Test/Train Split". This could mean that the files get split first then the training parts are merged and the testing parts are merged.
Another way it could be done is "Test/Train Split" -> "Files" which could mean all the files get merged first then split. This clearly needs to be hashed out more, but I wish to defer that until input from people like @RyanMcBerg is taken in. AKA, someone with a better understanding of how this could work out.

<Prediction> ... </Prediction>

I'm not sure this is the best way to group the data here, but this is what I came up with. @ASAAR Thoughts?

<Models> ... </Models>

I expect this to be very dynamically setup. Basically the idea I have for this section can be thought of as "overriding" or "setting initial parameters" for various models. Here the user can add their own settings to a model that they feel like might help the testing get off to the right start. Or, and what I feel like will be the more common case, the information here can be the parameters from the last time the model was run allowing the framework to sort of pick up where it left off. Or at least save some time by not just starting over. Again, my ideas for this are pretty rough and as a team we should hash this out.

Missing Data

I like the idea of giving the user the option to specify what ought to happen when it comes across missing data.
The first question I think should be answered is where should this be specified? At the project level or the model level? Personally, I'm toying with the idea of both where the user can specify at project-level (like a default option) and then still have the option to override at a model level if needed for a few situations.
The second question I think should be answered is what options we give the user. Of there should be the obvious one of simply giving the program a value to replace the missing data with. But we could expand on this by giving the user the option of also having the program will the missing data with the mean/mode/median values of that column (where applicable).

How would this look?

Heres an example of how I think this should look, regardless of the answers to the previous questions.

<Missing-Data>
  <Variable>
    <Name>Column Name</Name>
    <Value>0</Value>
  </Variable>
</Missing-Data>

This will allow the user to specify the default value per variable.

And as a final note, all naming is still subject to change. I am not really attached to the naming convention used here and if you have a better idea, please say so.

Update: The current SampleConfig.xml is:

<MLTF-Configuration
  name="SE2-KaggleComp"
  user="asclines" >
  <Firebase
    url="FirebaseDatabaseURL"
    account="FullPathToServiceAccount"/>
  <Files>
    <File
      type="test"
      path="/path/to/file"/>
    <File
      type="train"
      path="/path/to/file"/>
  </Files>
  <Prediction
    target="TargetVariable"
    type="Regression"/>
  <Models>
    <Model name="SomeModel">
      <Param name="Param1"
        numeric="true"
        defaultValue="0"
        delta="2"
        rangeStart="0"
        rangeEnd="10" />
      <Param name="Param2"
        numeric="false"
        defaultValue="value">
        <Value> value1 </Value>
        <Value> value2 </Value>
        <Value> value3 </Value>
      </Param>
    </Model>
  </Models>
</MLTF-Configuration>
ZakeryFyke commented 8 years ago

I don't have too much to comment as far as naming conventions go, however I definitely like the idea of using a file over constantly getting input from the command line since it's so cumbersome.

isaac-gs commented 8 years ago

Yeah, I absolutely agree with this idea. I just wanted to have Zak work on something which we could use immediately, rather than two weeks from now (since that would act as a road block for other work).

So yeah, I really like this idea.

asclines commented 8 years ago

@ZakeryFyke @ASAAR, I agree that this shouldn't immediately replace the user input as handled by #102 , I'm just getting the discussion rolling on the next stage of user input retrieval as this project grows. Clearly this idea isn't fully worked out and will take time to create. Thats why I am asking you guys for input on Moving Forward.

ZakeryFyke commented 8 years ago

To be certain I understand, we would have an XML file that the user could edit in order to change their desired settings such that they don't have to interact constantly with the command line? Is XML the easiest way for the user to specify these settings? I don't have an alternative in mind, just curious.

isaac-gs commented 8 years ago

I'd also suggest having some code that runs and forces the user to provide input if none has been given yet. That way we guarantee that we have configurations to use.

asclines commented 8 years ago

@ZakeryFyke Yes you are correct in how it will be used. As for whether or not XML is the easiest way; IMHO I believe it to be the best way to meet all our requirements while not making it too difficult for the user. That being said, there could be other ways that haven't come to mind and if you find any, now is the time to speak up.

isaac-gs commented 8 years ago

@asclines Continuing from prior conversation. Sorry about not doing this sooner.

Files

I like the idea but I could see that getting overly complicated. I suggest we either support single file or directory style training/testing data. Maybe even within their own subgroups like so,

Prediction

Could also call it "Objective", but I like it. We can also include stuff in there like the "evaluation metric" in case there's a specific kind of scoring being used.

Models

I don't know if this is what you had in mind, but this is a rough idea.

The ideas regarding hyperparameters are guided by this.

Thoughts?

asclines commented 8 years ago

Files

So we do not want a way to determine testing/training split? Or is that something we want the works to handle per model? Also, I may be mixing up terminology here, in this scenario do we define testing and training differently than when we are evaluating models?

Prediction

Objective makes more sense actually.

Models

yes thats more or less what I was looking for. Could you elaborate on model name and how that comes into play? As for increment/range, we may need to work out more how to define "limits" of a parameter in respect to whether or not its a numerical value. Would it better to just initialing describe a value as numeric or not and branch the limiting options off of that idea? i.e. Having completely different options for limiting a parameter depending on whether or not it is numeric?
This might be better than trying to hybrid this.

isaac-gs commented 8 years ago

@asclines

Files

Right, yeah I forgot to include that. I'd make that a bullet underneath "Train" since it doesn't apply to test. In this case, training is labeled data, testing is unlabeled.

Models

I was thinking we could try and guarantee a unique human friendly ID, which would be the name. The function name would be what is used to "find" the algorithm in the framework.

Can you explain your second point more? I don't feel like I totally understand what you mean. I feel like I agree but I want to make sure.

asclines commented 8 years ago

@ASAAR

Files

Okay, that makes me think we should make a terminology section of the wiki that has project-wide definitions for terms used as well as the scoped definitions of those same terms. Thoughts?

Models

Is this human friendly ID similar to the "label" idea that MLTA-record uses?

As for my second point, so to recap; we are trying to find a way for a user to override default values for a certain hyper-parameter (HP) used by a model. In a previous comment by @ASAAR on this issue, there were three variables a user could define that would help them define a HP's value; default value, increment value, range. That makes sense for numeric values and @ASAAR even tried to define the terms in a way that a non numeric value could be used.

Before I continue I want to clarify what I mean by limits. A limit for a HP is all the user-defined properties used to define things such as the HP's default value, and under what requirements a model's HPs can be altered by. What I'm proposing is we don't do it quite like the way @ASAAR states. The terms @ASAAR used make perfect sense for numeric data but not so much with non-numeric data. I think it would be easier in the long run to base what the user can use as limits on whether or not the data is numeric.

If the data is numeric, the limits can be: default value, increment/decrement delta, range.
If the data is not numeric, the limits can be:default value, and some other properties. I don't have any really good ones in my mind right now, I'm just pointing out a different idea in the hopes that somebody else might be able to continue this idea forward. I'm looking at you: @ZakeryFyke, @RyanMcBerg, @telelu03 , @ASAAR

asclines commented 8 years ago

Also, does anyone have any preferences on elements tags vs element attributes? I'm starting to think some things might be better off as tags. Here is an example using element attributes and also reflects changes discussed on this issue so far:

<MLTF-Configuration
  name="SE2-KaggleComp"
  user="asclines" >
  <Firebase
    url="FirebaseDatabaseURL"
    account="FullPathToServiceAccount"/>
  <Files>
    <File
      type="test"
      path="/path/to/file"/>
    <File
      type="train"
      path="/path/to/file"/>
  </Files>
  <Prediction
    target="TargetVariable"
    type="Regression"/>
  <Models>
    <Model name="SomeModel">
      <Param 
        numeric="true"
        defaultValue="0"
        delta="2"
        rangeStart="0"
        rangeEnd="10" />
      <Param 
        numeric="false"
        defaultValue="value"/>        
    </Model>
  </Models>
</MLTF-Configuration>
isaac-gs commented 8 years ago

@asclines sorry, forgot to respond to the one that's two up.

Files

That sounds like a good idea.

Models

Yeah, I agree that the original one assumed numeric. Although my idea was that if it was not recognized as a numeric, then the range would just be all values that the parameter could be.

Update

I like tags, it makes it a lot cleaner! One comment though, do we want param to have a name tag? Otherwise it might be hard to match the param values with the name of the hyperparameter or other value.

I hope I covered it all.

asclines commented 8 years ago

@ASAAR yes the param tag should have a name attribute thanks for pointing that out.

As for the ranges, let me work on that and get back to you. But while I do, do you have any specific way in mind on how to actually write the non-numeric values down in configuration?

Here is an update sample of the model XML element. Thoughts?

  <Model name="SomeModel">
      <Param name="Param1"
        numeric="true"
        defaultValue="0"
        delta="2"
        rangeStart="0"
        rangeEnd="10" />
      <Param name="Param2"
        numeric="false"
        defaultValue="value">
        <Value> value1 </Value>
        <Value> value2 </Value>
        <Value> value3 </Value>
      <Param>
    </Model>
isaac-gs commented 8 years ago

I would do something like this. Does that help? I don't know how annoying it is to have a list for the valueSet....

numeric="false" defaultValue="default" valueSet="default, optionOne, option_two, stuff"

asclines commented 8 years ago

The valueSet way as @ASAAR would be adequate for a list of small size. But remember, the reasoning for XML at all is for readability from the user's perspective. With that in mind, if the valueSet was rather large, it might be harder on the user than the list of elements as I have in the comment above.

isaac-gs commented 8 years ago

Right, but in the case I was talking about, that wouldn't be for numbers. It would be for strings. Numbers should be covered with ranges and deltas.

asclines commented 8 years ago

@ASAAR Sorry I mis-typed. I meant a list of small size.

isaac-gs commented 8 years ago

Right, but we're unlikely to have a valueSet of a large size with strings. We also don't need it at all with numerics.

isaac-gs commented 8 years ago

We also need to include a "unifying Id" for the different datasets. For example with the titanic dataset, you are given a test dataset with passengerId and you must predict if they survived or not. In order to provide results, you need that Id ahead of time.

asclines commented 8 years ago

So like IDs for time-series stuff?