CS7641 ML Final Review - Githubissues

2 Unsupervised Learning

UL consists of algorithms that are meant to "explore" on their own and provide the user with valuable information concerning their dataset/problem

Randomized Optimization
Clustering
- Single Linkage
- Consider each of n points a cluster
- Find the distance between the closest two points in every cluster
- Merge the closest two clusters
- Repeat n - k times to get k clusters
- Method:
  1. Consider each point a cluster
  2. Merge two closest clusters
  3. Unless we have K clusters GOTO 2
- Links points closest to each other.
- Can result in "stringy" non-compact clusters
- K-Means
- Each iteration is polynomial
- Finite (exponential) iterations in theory, but usually much less in practice
- Always converges, but can get stuck with "weird" clusters depending on random starting state
  1. Place k centers
  2. Claim closest points
  3. find the centers of the points
  4. Move the centers to the clusters of points
  5. Unless converged GOTO 2
- Expectation Maximization
- Gaussian Means
- Uses expectation and maximization steps
- Monotonically non-decreasing likelihood
- Does not converge (practically does)
- Can get stuck
- Works with any distribution (not just Gaussian)
- Properties of Clustering Algorithms (Pick 2)
- Richness
- Scale Invariance
- Consistency
- Richness
- For any assignment of objects to clusters, there is some distance matrix, D, such that P_D returns that clustering
- Scale-Invariance
- Scaling distances by a positive value does not change the clustering
- Consistency
- Shrinking intra-cluster distances and expanding intercluster distances does not change the clustering.
- No clustering scheme can acheive all of Richness, Scale-Invariance, Consistency

Game Theory
- Zero Sum Games
- A mathematical representation of a situation in which each participant's gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility of the other participant(s).
- Non-deterministic: the probability of the game
- Perfect Information Game
- All agents know the states of other agents
- minimax == maximin
- Hidden Information Game
- Some information regarding the state of a given agent is not know by the other agent(s)
- minimax != maximin
- Probability of Actions: the probability of the player
- Pure Strategies
- Mixed Strategies
- Nash Equilibrium
- No player has anything to gain by changing only their own strategy.
Repeated Game Strategies
- Finding best response against a repeated game finite-state strategy is the same as solving a MDP
- Tit-for-tat
  - Start with cooperation for first game, copy opponent's strategy (from the previous game) every game thereafter.
- Grim Trigger
  - Cooperates until opponent defects, then defects forever
- Pavlov
  - Cooperate if opponent agreed with your move, defect otherwise
  - Only strategy shown that is subgame perfect
- Folk Theorem: Any feasible payoff profile that strictly dominates the minmax/security level profile can be realized as a Nash equilibrium payoff profile, with sufficiently large discount factor. 无名氏定理：如果参与者对未来足够有耐心，如果参与者对未来足够有耐心，任何程度的合作（只要是可行的且满足个人理性）都可以通过一个子博弈精炼纳什均衡来达成。
- In repeated games, the possibility of retaliation opens the door for cooperation.
- Feasible Region
  - The region of possible average payoffs for some joint strategy
- MinMax Profile
  - A pair of payoffs (one for each player), that represent the payoffs that can be achieved by a player defending itself from a malicious adversary.
- (too vague, the worst case for one player): best in the worst possible conditions
- why it is a profile? a strategy profile refers to a set of strategies, one for each player in the game, that specifies how each player will act in the game. Say mixed strategy
- Subgame Perfect
  - Subgame perfect: is there a formal definition in English with good intuition?
  - How to find: backward induction See wikipedia
  - Always best response independent of history
- Plausible Threats: is there a formal definition in English with good intuition?
  - B threatens A, but with a cost of losing his own benefit, If A think B is bluffing, does not believe that B will follow his threat. which is not reasonable for B himself. Thus it is definitely in a subgame perfect condition. So, the plausibility of the threat depends on whether or not the other player believes that the player making the threat is capable of carrying out the threatened action. If the threat is believed to be implausible, then it is unlikely to be effective in achieving the player's goals.
- Zero Sum Stochastic Games
- Value Iteration works!
- Minimax-Q converges
- Unique solution to Q*
- Policies can be computed independently
- Update efficient
- Q functions sufficient to specify policy
- General Sum Stochastic Games
- Value Iteration doesn't work
- Minimax-Q doesn't converge
- No unique solution to Q*
- Policies cannot be computed independently
- Update not efficient
- Q functions not sufficient to specify policy

Reinforcement Learning

Reinforcement Learning: A Survey

Put an agent into a world (make sure you can describe it with an MDP!), give him some rewards and penalties and hopefully he will learn.

Markov Decision Processes
- Building a MDP
- States
  - MDP should contain all states that an agent could be in.
- Actions
  - All actions an agent can perform. Sometimes this is a function of state, but more often it is a list of actions that could be performed in any state
- Transitions (model)
  - Probability that the agent will arrive in a new state, given that it takes a certain action in its current state: P(s'|s, a)
- Rewards
  - Easiest to think about as a function of state (i.e. when the agent is in a state it receives a reward). However, it is often a function of a [s, a] tuple or a [s, a, s'] tuple.
- Policy
  - A list that contains the action that should be taken by the agent in each state.
  - The optimal policy is the policy that maximizes the agent's long term expected reward.
- Utility
- The utility of a state is the reward at that state plus all the (discounted) reward that will be received from that state to infinity.
- Accounts for delayed reward
- Described by the Bellman Equation
- Value Iteration
  - "Solve" (iteratively until convergence, more like hill climb) Bellman Equation.
  - When we have maximum utility, the policy which yields that utility can be found in a straightforward manner.
- Policy Iteration
  - Start with random (or not) initial policy.
  - Evaluate the utility of that policy.
  - Update policy (in a hill climbing-ish way) to the neighboring policy that maximizes the expected utility.
- Discount Factor, (typically between 0 and 1), describes the value placed on future reward. The higher is, the more emphasis is placed on future reward.
Model-Based vs. Model-Free
- Model-Based requires knowledge of transition probabilities and rewards
- Policy Iteration
- Value Iteration
- Model-Free gets thrown into the world and learns the model on its own based on "[s, a, s', r]" tuples.
- Q Learning
Three types of RL
- Policy Search - direct use, indirect learning
- Value function based - ^Argmax
- Model based - indirect use, direct learning ^Solve Bellman
Q Learning
- Q Function is a modification of the Bellman Equation
- Learning Rate, , is how far we move each iteration.
- If each action is executed in each state an infinite number of times on an infinite run and is decayed appropriately, the Q values will converge with probability 1 to Q*
- Exploration vs Exploitation
- Epsilon Greedy Exploration
  - Search randomly with some decaying probability like Simulated Annealing
- Can use starting value of Q function as a sort of exploration

1 Information Theory.

what is entropy, how to calculate?
- calculate the expected message size from probability of each letter and encoding? (Variable Lengthe Encoding)
What is Joint Entropy: mutual information?
What is K-L divergence?
Information Theory
- Entropy: A characterization of uncertainty about a source of information
- Joint Entropy: The entropy contained by the combination of two variables
- Conditional Entropy: The entropy of one variable, given another
- Mutual Information: The reduction of entropy of a variable, given knowledge of another variable
- KL Divergence: A non-symmetric measure of the difference between two probability distributions P and Q
- Can be used in supervised learning as an alternative to squared error

Feature Selection
- Filtering
- Choose features independent of learner. i.e. "filter" the data before it is passed to the learner
- Faster than wrapping (don't have to pay the cost of the learner)
- Tends to ignore relationships between features
- Decision Trees do this naturally (Filter on information gain)
- Wrapping
- "Wrap" the learner into the feature selection. Choose features based on how the learner performs.
- Takes into account learner bias
- Good at determining feature relationships (as they pertain to the success of the learner)
- Very slow (have to run the learner for each feature search)
- Speed Ups
  - Randomized optimization
  - Forward/Backward sequential selection:
- Relevance
- is strongly relevant if removing it degrades the Bayes' Optimal Classifier
- is weakly relevant if
  - it is not strongly relevant
  - a subset of features S such that adding to S improves Bayes' Optimal Classifier
  - is otherwise irrelevant
- Relevance vs. Usefulness
- Relevance measures the effect the variable has on the Bayes' Optimal Classifier
- Usefulness measures the effect the variable has on the error of a particular predictor (ANN, DT, etc.)
Feature Transformation
- Polsemy: Same word different meaning - False Positives
- Synonomy: Different word same meaning - False Negatives
- PCA:
- Example of an eigenproblem
- Finds direction (eigenvectors) of maximum variance
- All principal components (eigenvectors) are mutually orthogonal
- Reconstructing data from the principal components is proven to have the least possible L2 (squared) error compared to any other reduction
- Eigenvalues are monotonically non-increasing and are proportional to variance along each principal component (eigenvector). Eigenvalue of 0 implies zero variance which means the corresponding principal component is irrelevant
- Finds "globally" varying features (image brightness, saturation, etc.)
- Fast algorithms available
- ICA
- Finds new features that are completely independent (from each other). i.e. they share no mutual information
- Attempts to maximize the mutual information between the original and transformed data. This allows original data to be reconstructed fairly easily from the transformed data.
- Blind Source Separation (Cocktail Party Problem)
- Finds "locally" varying features (image edges, facial features)
- RCA
- Generates random directions
- It works! If you want to use it to preprocess classification data...
  - Is able to capture correlations between data, but in order for this to be true, you must often reduce to a larger number of components than with PCA or ICA.
- Can't really reconstruct the original data well.
- Biggest advantage is speed.
- LDA
- Requires data labels
- Finds projections that discriminate based on the labels. i.e. separates data based on class.

YunchaoYang / Blogs

CS7641 ML Final Review #36

2 Unsupervised Learning

Reinforcement Learning

1 Information Theory.