Get some articles from the deep learning talk

cesine commented 10 years ago

U de M

deep learnign and its applications (games, player match making)

intro to machien learning
deep learning paper
more recent aper for audo encoders
future work

Artifical definition (moving definition) "study of how to make computers do thigns at which at the moment, people are better"

fast speed in the past few years

since 90s, statiscal ml

represent knowledge and reason with it.

Newton ~ equation rules, logic, expert system stats from data

shows examples of knowledge representation (with plots)

2d space

digits can be sepaate dif you represent it in the right way. use gap to build a classifier.

frequentist learning: (focus in talk)

Model is P(D|0) probabity of d given theta

maximize likelyhood of function

also minimze divertion betweeen unknown underlying distorbuiton and our model

iid identically independatly distributed

another estimator, retlarized, add a lamda term to penalize the complexitoy of the model (fucntion of theta)

MLE_reg

MAP posterie can be similar because it coudl be the log of the reglarasiotn aprameter

historical data
prior knowledge "no free lunch" no best model, domain specific success

"curse of dimensionality" bellman 61

as dimensionality increase, volume become sparce and will never have neough data

data set size needed has to grow exponentially

solution: we can exploit the structure of the data.

learn a distributed and heirachical representation

showed example of a deep neural net, with binary input (with h_1 etc hidden units)

2^4 transforms to 2^5 hidden represenatiaotns, dont need that many paramters, 4x5 is only 20 parameters, and get more with more layers

more layers = more abstract representations

2: Video game match making

stacked calibration of offpolicy problem, looked at old literature, didnt correspodn to video game matchmaking so proposed stacked calibration

today will focus on 2nd part, hwo to train the model

"what makes a game fun?" Malone 1981, Yannakakis 2007

First person shooter:

both teams equal chance = fun \ do player suveys after the game 1-5 stars (player doesn tknow or udnesrtand if he/she had fun, so noisy data)
both teams have same numerb of skills, implies result of game and the margin team A wins over team B

3 works best for prediction of player enjoyment

compared result wiht micorsoft true skil measure, ubsoft developers agree that 3rd measurement is an objetive measure of player enjoyment(?)

regression proiblme of 1 - log of (how many times A kills/ divided by how many kills by B)

log will penalize outliers when the ratio is too big or small

take absolute value of the log to ignore if a or b was winning.

how far it sfrom 1 is disballanced (thats why its 1 - c)

inputs to the equation

pairwise ranking (player skil as one number)

micorsoft trueskills method (2 numbers, mean, and stdeviation) rich player features Delalleau et al 2012 (bullets fired, critical shots, charater sot, wepon usage, equipent ,time spent in cover, winning history)

Elastic Net (Zou & Hastie 2005)

target can be predicted by linear combination of inputs, penalize by L1 var selection and L2 avoid overfitting

leaving linear model, tried the regression tree Quinlan 1993

start with root, distribute training set into blue leaf nodes

find a path is th eproediiton, and find averate of that leaf node to any point in the test set.

instead, can ues a random forest Breiman 2001

bootstrapping randomness
growing randomess on how many features to split

prediciton is based on average k trees in the forest (bagging)

READ Brieman

can do bosting, gradient bosted trees Friedman 1999

have a los function, defiend as mean squared betwen label and function prediction with step size of alpha,

dont gradeia for parameter theta, but instead the function.

then gradient is fit to a regression tree.

update hte model by step gradiant descent

all the steps combined are the model

now maxout networks, a feed forward neural network Goodfellow et al 2013

h1 and h2 as maxout units, whci see a subset of layers below. they get the max of their pool. repeat thaat structure in layers

train by dropout and avoid overfitting

READ Goodfellow 2013

436000 matches 160000 players

R^620 dimensions

model was selected by cross validation

showed the best models of all the algorithms,

can see that the maxout performed best

mean square on test set MSE was the measure, tighter mean is better, ie smaller around 0.06 to 0.065

did a t-test to see if the algorithms were significntly different

MLP and Random forest were not differente

use to

build a orrputed data set. how and wy does it work

ergodic proabbly to go to any other state.

the only way to make it random is the theta paramter, can use a non paraterim model like the parzen densitiy estimator (mixture of gausians centered around a training point)

can use it for the conditional distributon of x given corrputed ~x

another choice for the denoising model is parametric, for denoising auto encoders, has a model with x as input, corput it by the function , then feed corrputed x into the model.

from y, the encoded, we decode using x^ to get reconstructed x, measure the error by real x.. mse means squred entropy and cross entropy,

experiment, toy example to push asymptoptic limit of the theorem. training set of dimension 1, x has 10 descrete values (0-9) genearte 5000 training exampes with multinomal distrobation of Theta

(probablyties of geneating each of the descrtete values)

then sample y frmo normal and standard dev into an integer and mod 10, move away frm the retion al valu and still get a value that is in the data set

basically a probability tabel of 10x10 filled with count of frequencies in data aset

essentially generating white noise.

sigma is the corruption noise.

showed a bar graph of the values and their counts, has a peak at 6. if noise is .1 dont estimate the empirical distribution well they conentrate at 9, markov chain never mises cant get from 9 to oher numbers.

if increats noise to 2, sample is better.

if you add lots of noise, the bars will meet, the denoising will not be differnt from actual distriobtion.

stil using toy dataset.

1D manifod in 10space, a single row in 10d space.

used normal gausian nose centered around x.

for denoising model used the conditioan partzen estimator

hard to visualize 10d set, plotted a pair of axis

i and j, sample geneatoed by markov chan, and the original data. the range is not the same on the graphs, but the pictures should be similar.

y says better if unites are the same more realistic setting MNIST data,

.5 salt and peper noise.

model can do a good job in denoising.

now talk abou twalk back procedure.

normal training in the rectangle, have encode and decode, in walkback add 2 more boxes,

now can have 2 reconstruction erros, can think of it as running a morkove chan during hte training since the goal is to use markove for the decoding

back propagation can go through to the sample, and is blcoked (dependign on the sampling, if ocntinous then it can still bakpropagate through the sampling. they did binary so they coudnt back propagate)

what does walkback do? Bengio et al 2013

blue manifod of the data, denoising trainign as wandering around trying to go back to the data manifold. arrow is direction of datapoints to folow ot get to manifod, and ther are sparce spots that the x doesnt visit. anything that goes into x, doesnt go back to manifold. minimzie the reconstuction error will take from x to mainfod, more walkback steps, more space visited.

showes geneated samples with no walkback, and generated samples with walkback.

lots of 9, very little 4, some 7 some 1

in the walkback we have far more represetnation of the digit space.

how to quantify the x of walkback.

part 4

future work

we need multimodal resconstruction

the current denosier assumes indepencaies, can relax it in anyway, shows an example using proabailty chain rule

the reason we need moltimodal resconstruction is becaue of some mean digits, can see combiation of digitis piling up together, with multimodal can use second bit considering 1st bit, 3rd bit based on 2nd and 1st bit etc

thats what probability chaining...

NADE some have doen using neural autoregressive distriboutionestimator (see other students work)

shows NADE with the corrputed x feeding into the NADE

adds the ~x to be condtioned uppon. whcih becomes the bias of the nade model.

can model multimodal distrubito with that equation

a special case of generative stocasic networks, if you hid the layers above, butcan ahd more hiddne units with more nose to mix well. to

ghe goal, to use the GSN to pretrain MLP whcih is standard for u de m, somethign about classificaiton and reconstruction trained together.

usign GNS for classification for exploration.

question period

idea to use more h layers to manage dimensionality.

li says that the smal or large of hidden nodes isnt the main way to fight dimensionality, idea si reduce paramters.

forward to deep learnign videogames 1 -log(A/B)

its not diferntiatable? yes, not everywhere becasue there is an absolute value, isnt it very unsmooth? if you fix kills B, then doesnt it look like ^ wasnt it a problme, the audience says you predict y, you dont back propagate through the

the cost fucntion is just mean swuared.

elastic net, why did you try it other than just to use linera regression? LI sys want to incororate L1 adn L2 penalization, so we used it. but if you tune the regularization coeff then it convertions to a linear regression

features you feed in its 620 dimentions, from concatinationg the features? 8 players in team, each has x features, sum up all players in team for team feature, and concat team a and team b to give the dimentions

trying to predict the kill feature, basted on the other stuff (bullets, wepons, critical shots which hit ht ebody)

strange to concatinate for each team, and feed into the network, becasue the network shoudl compare the suare distance btewene crtical shots between teams, ie the features should be features in the network, not just contacting the features as the base data

y mentoins waht about networks that have one part per team. they trid a network tht has a unit, wiht weights connectd to features of team, can train weights to distinguish teams.

delalleau has another model that is close to what roland is suggesting, to predict the player found not kill ratio. pkay another model of features per team, and dif of the features fed to anoher model. it was visualizaiton.

the value is going to be in the difference betweent eh teams, coudl do feature engineering to improve things, and run better than lots fo differnet models.

yann question, concatination, can the features be recovered through the concatination. poitn is if you know a good feature, its better to put it in. previosu games, and the number of critical shots against diferetn teams might not be that predictive, so that coudl weaken the value so that might mean its not that awesmoe as a feature.

MSE .o6 worst and .08 best can you give an intuiaton if .06 is good? woudl it be much better if you used better features than concatinationg all the features for the team?

now to GSN denoising

when you defined p|x given x~

corrpution distribut c of x~ given x

shoudlt it be possibel to infer theta given the corrpution, if hte auto encoder model is linear (ie gausian noise) can you characterize is it easy to write down p of x given x~ if p(x) is gausain and c(x) is gausain, can you use abaiyse rule can you derive p(x0 but y says we dont have p(x) so it moot

when you do parzen estimator, what happens when you just sample, the 10d worms, what woudl a partzen estimater do. adn do asample from it. woudlnt it look similar too? so why use 10d rather htan 2d? was to make it high enouvh dimentiosn so a simple densitiy couldnt do well? why did we do this experiemnt. verificaiton of a theory,we are interested in teh conditional distribution, the joint is complicated, multimodal and has complicated shape. want to model taht distrbution, so foudn a proxy, the conditional distribution, noise, denoise, noise, denosie, can sample form the joint. a workound of the joined distribution, thats why 10d not 2d. why not just use partzen density estimator, its non parametric, cant scale and has to memorize all traininng examples, need xpnentail gausians to model local patches in high dimensional space.

jsut to prove that if you follow the markov chain you can

last question: walkback trainign looks like PCDE tell me the differences? Li says its an analogy, not mathematically. borrowed the idea from contrastive divergce. negative examples in pcb push up and in auto encoder walk back to manifold.

sebastien, outside fothe field, but digits come back to compuer vision.

about the matchmaking 620 samples etc, everyplaeyr independatn entitiy? only related to his own past? what about relation between players playing togeter with microphones? how do you tackle that relation ship? LI says this si waht hte model is doing, in the non linear model, try to establish a nonlinear rleationship between player features and player features iwht targets, thats why mlp outperform non-linear models

the player interaction is implicit in the model, so its in there.

y, ddi we use embeddings? mapping of all player interaction hidden layer if meaning if good sniper, team up with agresive to get ballenced game. can have player 5 plays with 7 this is what is going to happen. but that didnt help for that model

ground truth go out and take pictures, in the case Li used, what does it mean? sebastian tried to find out if the survey coudl be ground truth. no humans dont answer truethlfully. but if you make a ground truth wont it introduce a bias? introduciong a regualrity ni the data that will cause some models to fair better than others.

balance is hard to measure, only have proxies. so try...

regardign salt peper noise, to imges

but there are determinstic menthods that can do that (filters to remove noise), why did you use salt peper, what is noise int hat context? asusmeo alow dimnesional minfold of x the high dimensiaonl space

salt and peper noise moves you out of the manifold?

700 image intenseities, and salt and peper is rapid intensitiy change, moving really far. some ar enot moving. very differnt from gausian noise, a ittle bit in every diemention. thaats how you stay close to the original. because gausiand oesnt move away enough ,we need dramatic but not too far. how abotu rotate the pictures by 90 degreees, yes if you randomly rotaate wis it nose. y will it work for the GSN? dont know if its hard to train a model to dhat, if have a conditional model of the rotation, shoudl give the not rotated model. y sasy it woudltnwork, that the chains are x so you arent visiting, you are on the manifold.

why the sample size? picked a numebr to get a tood estiamte, more theta we have the better estimator. in a toy example its not htat hard dont need that many samples. variation is not a lot. not like obejt recogniton rotation viewing angle and lighting

why do 10d , why not reduce ot 5? and take the most indicative?

if you do pcosvd it would be the same. you loose infomation, dont know if its idfernt or not, its lost. need a score for relavancy, how doy ou know that the 10d are being used?

task here is to show its exactly the same, so it woudlnt matter in that case.

y questions:

curse of dimensionalit is it really the number of dimensions that matter? Li: its actually not, where dimentiosn can be linear expressed ith a small apr to fhte data set

explaintaion of the p values table, li expalined it using the white board to recover frmo his flub earlier.

slide in second paper

for theoretical justificaiton what is the meaning of n, number of training examples.

then question fo json, why is walkback better than having more corrpution? a bigger sigma in the corrputoin. if yo have bigger sigma, its harder to get back, why not true of walk back? (because walkback is a proabbility chain, and if each node is recoverable, you can get there, but if its a leap, you have not enough entropy)

becasue we train with walkbakc, the model contain the markov chain intution. walk back helsp focus on x~ that have spuriso nodes, with out walkback, cant zap as well, need more corputions to ge to the spiorus nodes.walkback is more efficiaent, go to place wher eit matters. adn thats why there is an analyogy with cd?

if manifod is a circle, or U shapepe,d dneistity is too high in the center, is adding nose goin to interfere with the other cycle. will start reconstructiong those one on the other side.

two extream cases, noixe is 0 and noise is infinite, the theorem goes wrong in boht, but what else?

no noise, fixes on one sample (9) if too much noise, model cant recover to the original manifold if too much noise, form one digit should be able to go back ot all other digits. to fix that, we can ask the model to go to one digit, with some probability and other siwth other probaility iht multipmodal distribution.

cesine commented 10 years ago

http://www.iro.umontreal.ca/~bengioy/yoshua_en/Highlights.html

get references for :

Malone 1981,
Yannakakis 2007
Breiman 2001 (details might be in http://arxiv.org/pdf/1305.6663v3.pdf)
Goodfellow 2013 (details might be in http://arxiv.org/pdf/1305.6663v3.pdf)

hisakonog commented 10 years ago

dropbox > inuktitut_materials > machine learning

cesine commented 10 years ago

here is some more stuff on gamification in data entry, from stack overflows "then a miracle occured"

http://blog.stackoverflow.com/2013/09/five-years-ago-stack-overflow-launched-then-a-miracle-occurred/

Five years ago, Stack Overflow launched. Then, a miracle occurred. 09-16-13 by Jay Hanlon. 36 comments

Stack Overflow officially launched on September 15, 2008. In five short years, you’ve answered over 5 million questions on more than 100 sites, and helped hundreds of millions of people find the answers they needed. Today, we want to celebrate how, together, we changed one small corner of the Internet for the better.

We want to hear your stories about how someone on Stack Exchange helped you.

“Then, a Miracle Occurs” Before it went into beta, stackoverflow.com had a comic on the landing page that came to symbolize what we were setting out to do:

We knew what our goal was, and we had some idea how to start, but the entire thing working was predicated on that middle step: “then a miracle occurs”. The original vision statement was ambitious:

It is by programmers, for programmers, with the ultimate intent of collectively increasing the sum total of good programming knowledge in the world. No matter what programming language you use, or what operating system you call home. Better programming is our goal. (from Introducing Stack Overflow, emphasis added) It was a gamble: would people really take time out of their busy lives to answer other people’s questions, for nothing more than fake internet points and bragging rights?

It turns out that people will do anything for fake internet points.

Just kidding. At best, the points, and the gamification, and the focused structure of the site did little more than encourage people to keep doing what they were already doing. People came because they wanted to help other people, because they needed to learn something new, or because they wanted to show off the clever way they’d solved a problem.

Which was lucky for us. Because here’s the crazy secret about gamification: In the history of the world, gamification has never gotten a single person do anything they didn’t already basically like to do.

In the midst of everyone’s individual reason for coming, somewhere among the hundreds, and then thousands of people who showed up to answer each other’s questions and hammer out how the site should actually work, the miracle actually occurred.

An incredible number of people jumped at the chance to help a stranger So far, you’ve provided helpful answers to over five million questions. Those answers are seen by forty-four million people looking for help each month.

To put those numbers in perspective:

That’s more people helped each month than visit the New York Times, Bank of America, or Apple.com. If the people helped each month were a US state, it’d be bigger than California and almost twice as big as Texas. If they were a country, it’d be in the top 15% of nations in the world, with more people than Canada, Argentina, or Poland. It’d be practically two Yemens. If you put one frog in a football stadium for each of the 44MM people who get help here each month, that would be forty-four MILLION frogs. Think about that. But don’t say it out loud. People are quick to judge. Making the Internet a Better Place The next chapter of Stack Exchange is still being written. A few years ago, we widened our vision beyond programmers. Our new goal was simple, if a bit daunting:

Make the Internet a better place to get expert answers to your questions. fredrogers shadow

We asked people what other sites they wanted, and carefully started launching them, one at a time. Each time, we were counting on a group of experts to come together and start asking and answering each other’s questions. There have been a few failures along the way, but overall, the successes have been amazing.

We’re now up to 106 sites, including some outstanding ones on System Administration, Computers, Mathematics, Ubuntu, Video Games, and Cooking, and some young upstarts like our site for English Language Learners. If there’s a site you want to see that doesn’t exist yet, you can still propose it on Area 51.

At the same time, Stack Overflow is continuing to grow, and we are doing our best to keep it healthy. The short history of the internet is littered with communities that started out great, but slowly petered out under the weight of flame wars, mass-n00bocide, funny cat pictures, or just boredom waiting for the next big thing. We still need your help to keep Stack Overflow focused on its core mission: collectively increasing the sum total of good programming knowledge in the world.

Tell Us Your Story We want to hear your stories. Looking at numbers is one thing, but hearing from real, live people about how someone’s effort here helped them is entirely different. So, if someone’s post here ever saved your day at work, or convinced you to buy your daughter an SLR and learn photography together, take a minute to recognize the person who wrote the answer that mattered to you.

If you’re somebody who mostly answers questions, share how you got involved and what keeps you coming back. Or tell us about someone who taught you something before we even existed. They deserve to be recognized for the way their investment in you is getting passed on to others here today. If Stack Exchange got you interested in a new topic or taught you a new trick for an old one, we want to hear about it.

Stack Exchange has always been about a community of people helping each other out. It was a long shot when it launched, but you made it work. Now, let’s take a few minutes to recognize everything that we’ve achieved together.

gretchenmcc commented 10 years ago

Another source for gamification of language stuff (here, language learning) might be Duolingo, which is an app for language learning although at the moment they have mostly Romance languages. The overall structure and lesson styles might be inspiring for the Learn [Language] app though. More on their blog: http://blog.duolingo.com/

FieldDB / FieldDB

Get some articles from the deep learning talk #1131