Data handling - Githubissues

freivalds1 commented 8 years ago

As close as I can tell, the actual "classification" for a single entry in our data set arrays may be at the beginning or end (maybe elsewhere) index. This seems to indicate we either need to edit the read in (say, when we fill in missing values) so that the classification index is in a uniform place (probably index 0) or we need to amend the template method with an additional variable to indicate what index the classification falls to, so we know what to reassign/classify and what to compare to.

freivalds1 commented 8 years ago

I'm currently programming under an assumption that a entry's classification will be index 0 of it's attributes.

gneznanski commented 8 years ago

Isnt each column a new category? If it is, we want to keep all those values together instead of each line. Would need seperate arrays for each column. Am I wrong on this?

From: freivalds1 notifications@github.com Sent: Thursday, November 10, 2016 6:53 PM To: LizzieHerman/MachineLearningExperimentation Subject: Re: [LizzieHerman/MachineLearningExperimentation] Data handling (#1)

Further thought indicates it may be exceedingly useful to pass to the algorithm how many classifications/categories there are for a given entry, rather than determining at run time?

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LizzieHerman/MachineLearningExperimentation/issues/1#issuecomment-259860892, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANW1fFrpBiZz9mmhMdlfbHIJsOWpRGtDks5q88qHgaJpZM4KvUci.

freivalds1 commented 8 years ago

Kinda lost me with your question. The way I currently interpret it is for any dataEntry[e][a], [e] is the entry and [a] is the attribute, so if we had a table of 600 entries with 20 attributes each, it would look like dataEntry[600][20]. I think we should store the classification as the first attribute always for ease of testing/training.

gneznanski commented 8 years ago

Im asking if the 20 should be seperate instead of bound to an entry. As in every column 1,2,3 etc for each entry is together. I dont see why we need them to be attached to the entry. The algorithms dont even use the entry value, only the values in each category. For example, the ID number in the breast cancer data has no impact on anything since we dont care who it is, only their category information. For the voting data dem/repub is probably important. Right?

From: freivalds1 notifications@github.com Sent: Thursday, November 10, 2016 11:51 PM To: LizzieHerman/MachineLearningExperimentation Cc: gneznanski; Comment Subject: Re: [LizzieHerman/MachineLearningExperimentation] Data handling (#1)

Kinda lost me with your question. The way I currently interpret it is for any dataEntry[e][a], [e] is the entry and [a] is the attribute, so if we had a table of 600 entries with 20 attributes each, it would look like dataEntry[600][20]. I think we should store the classification as the first attribute always for ease of testing/training.

You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LizzieHerman/MachineLearningExperimentation/issues/1#issuecomment-259893348, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ANW1fFInFtHiesij24z2lZCpbgcdbfP8ks5q9BBugaJpZM4KvUci.

freivalds1 commented 8 years ago

Originally that was the plan with data objects, but it was determined that this implementation will still make it simpler to handle. For example, for my alg I can make multiple comparisons across entries for a single attribute using for loops dataEntry[1][5] - dataEntry[600][5]. In the end though, for the most part, I think this is one of those things that would work fine either way, we just need to choose an implementation and stick with it.

LizzieHerman / MachineLearningExperimentation

Data handling #1