Open bhargavvader opened 3 years ago
Title: Computational Performance of Deep Reinforcement Learning to find Nash Equilibria
Summary: In this study, the researchers tested the performance of deep deterministic policy gradient (DDPG), which is a deep reinforcement learning algorithm, to find the solutions of Nash Equilibria where firms set prices for their products. The motivation is to find counter-factual outcomes in an economic pre-defined mechanism. For this study, the goal is to simulate “strategic bidding in capacity constrained uniform price auctions”. In this study, the researchers only consider a situation where exactly two firms are competing. DDPG is used to learn a bidder’s actions and hyperparameters will be tuned. Benchmark solutions are provided, so the results of DDPG are measured with the Benchmark solutions. The researchers found that DDPG performs well to study competition in capacity constrained uniform price auctions without strong behavioral assumptions on the agents. In addition, normalization method of a DDPG model affects its convergence, who also impact the willingness of agents to participate in the auction together with memory model.
Expansions to social science analysis: I found this paper very interesting mainly for two reasons:
New dataset exploration: I think economists can make more endeavor to apply advanced computation techniques (including but not limited to deep learning) to study microeconomics issues, like price setting problems and quantifying the consumption habits of consumers in different markets.
Title: A Study of Reinforcement Learning for Neural Machine Translation
Summary In this paper, taking several large-scale translation tasks as testbeds, authors conduct a systematic study on how to train better NMT models using reinforcement learning. They provide a comprehensive comparison of several important factors (e.g., baseline reward, reward shaping) in RL training. Furthermore, to fill in the gap that it remains unclear whether RL is still beneficial when monolingual data is used, authors propose a new method to leverage RL to further boost the performance of NMT systems trained with source/target monolingual data. By integrating all our findings, they obtain competitive results on WMT14 English- German, WMT17 English-Chinese, and WMT17 Chinese-English translation tasks, especially setting a state-of-the-art performance on WMT17 Chinese-English translation task.
Possible Application My first instinct is to apply such translation technology to the understanding of ancient languages. Many translations of ancient texts have been done by a very small number of experts with a great deal of linguistic knowledge, but if we could apply this technology to the process of converting ancient languages to modern languages, human use and understanding of ancient texts would probably be enhanced even more.
Database Lexicity. It is the first and only comprehensive index for ancient language resources on the internet. It contains an index of online resources for the study of Akkadian, Aramaic, Coptic, Egyptian, Ge’ez, Old Georgian, Gothic, Greek, Hebrew, Latin, Old Church Slavonic, Old English, Old Norse, Sanskrit, Syriac, and Ugaritic, with relevant links to new stuff, out of copyright grammars and texts on archive.org, and so forth.
Title: Deep Reinforcement Learning for Trading
Summary: the researchers applied state of the art reinforcement learning algorithms (Deep Q-learning network (DQN), policy gradient (PG), and advantage actor-critic (A2C)) to engage in algorithmic trading by maximizing the utility function of the expected cumulative trades returns. The Reinforcement learning models are tested with 50 future contracts from 2011 to 2019, and found that the model out perform classical time series momentum strategies.
Expansion to social science: The authors talked about expanding the reinforcement learning strategy to trading by taking into account of other objectives, including minimizing risks or volatility.
Data: researchers can experiment reinforcement learning on other prediction games (e.g. other financial products, sports betting (fanduel, draftking), or even casino games).
Collaborative Multi-Agent Dialogue Model Training Via Reinforcement Learning (Uber AI)
Summary: The authors build a model that concurrently trains conversational agents that communicate only via self-generated language. They train two networks - natural language understanding (NLU) and generation (NLG) for each agent and let the agents interact online. The interactions are modeled as a stochastic collaborative game where each agent (player) has a role (“assistant”, “tourist”, “eater”, etc.) and their own objectives, and can only interact via natural language they generate. Each agent needs to learn to operate optimally in an environment with multiple sources of uncertainty (its own NLU and NLG, the other agent’s NLU, Policy, and NLG). They demonstrate that their stochastic-game agents outperform deep learning-based supervised baselines. Each agent learns in a decentralized setting, only observing the other agent’s language output and a reward signal.
Social science application: Applying RL-based techniques to model agents conversing in natural language is interesting as it might be able to model discourse online. There are some salient similarities between the RL paradigm set in this paper and real-world discourse. Firstly, the modeling is of multiple agents conversing only through language. Secondly, the agents learn by observing language output and the reward - which in an online setting may be measured through increased positive interaction related to the agent output or 'likes' or 'upvotes'. We might be to work backwards and parameterize the agent 'policies' by observing how real-world users alter their language after receiving social feedback on their speech.
Possible dataset: How do Twitter users evolve and learn from their engagement with each other. We could model the evolution of agents' Twitter statuses in response to the feedback (likes/retweets) they receive on their statuses. We could parameterize the user's 'policy' (which could be a political position or conversational-state parameter) by fitting a RL model that best matches the evolution of actual speech observed online (this idea is far out given the current state of RL but might be feasible!).
Title: Social Diversity and Social Preferences in Mixed-Motive Reinforcement Learning
Summary: This research studied the effect of population heterogeneity on mixed-motive reinforcement learning. Reinforcement learning has a natural connection with multi-agent models as the learning algorithms could serve as rules for the actions of agents. The researchers relied on the interdependence theory from social psychology and built Social Value Orientation (SVO) to reflect the preference of different types of agents. The diversity of SVO was explored via two Markov games. Specifically, mixed-motive means that the group’s incentives are sometimes aligned and sometimes in conflict. Agents’ social value orientation was defined as the tradeoff between an agent’s own reward and the reward of other agents in the environment and was represented by reward angles. Agents with different SVO would have different utility functions to respond to the environment. The training process was launched by the advantage actor-critic algorithm (A2C) with a neural network and was set up in the environment of two temporally and spatially extended mixed-motive games: Harvest Patch and Cleanup.
They found out that heterogeneity in SVO generates meaningful and complex behavioral variations among agents similar to that suggested by interdependence theory, and the agents trained in heterogeneous populations developed particularly generalized, high-performing policies relative to those trained in the homogeneous population in dealing with mixed-motive dilemmas.
Extension to Social Research: One contribution of this study is the exploration of heterogeneity of the agents in the reinforcement learning systems. Here the authors used the SVO to characterize the diversity of the agents, other possibilities might include Big Five Personality traits or risk aversion levels for analyses of different social science questions. Also, the A2C algorithm is also a potential choice for other researchers to deploy to observe policy and strategy formation among societal agents. Additionally, their evaluation process provides a valuable example of how to interpret the results of reinforcement learning methods in terms of economic return, equality, and prosocial traits of the agents.
New dataset exploration: According to the papers I have looked through about DRL, many of them are actually using simulated data during the training process as the information from the environment, which is quite different from other neural networks performing supervised or unsupervised tasks. However, a possibility I am thinking about is that maybe we can use some real dataset that entails the evolution of people’s decisions/reactions or policy-making to train the DRL and approximate the strategy that real people use, and then compare that to a model based on our pre-defined optimal strategy for agents. In this case, we might find out the difference (like the rationality/irrationality of the agents) between the real world and our optimized world, and gain instructive messages for socioeconomic development.
AlphaPortfolio: Direct Construction Through Reinforcement Learning and Interpretable AI https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3554486
Summary: In a nutshell, the portfolio allocation problem is to find weights on individual stocks, which lead to optimal performance (e.g. in terms of the Sharpe Ratio) subject to certain constraints. The classical paradigm has its limitations: expected returns and (large) covariance matrices are difficult to estimate. This paper adopts a reinforcement learning approach to explore and exploit the parameter space. To be specific, the state space is a wide array of past characteristics of stocks. The action the agent takes is their portfolio weights choice. Each episode is a year of the state-action trajectory (with an effective training sample of 25 years). The final reward is the average Sharpe ratio of all the episodes. With this setup, the authors construct a deep learning framework with three components: (1) sequence representation extraction module(SREM); (2) Cross-asset Attention Network (CAAN); (3) Portfolio generator. In the first step, the authors used a transformer encoder and long-short-term memory network to transform the recent history of stock characteristics into a latent state variable for each asset. They then used a cross-asset attention network to get a winner score for each asset from the learned latent state. In the final step, they long the stocks with the highest winner score the short the ones with the lowest winner score. In a nutshell, the algorithms provide a flexible mapping from characteristics and its recent history to portfolio weights. Note that this is a huge state space with 51 characteristics times 3600 stocks/month times 12 months history. The authors rely on reinforcement learning to solve this optimization problem.
Extension to Social Science Research RL's signature is its exploration and exploitation trade-off, like in a multi-arm bandit problem. The exploitation means whether to repeat decisions that have worked well so far. The exploration means making novel decisions, hoping to gain even greater rewards. It would be interesting to see the influence of the agent's action on the states of the problem and use a probabilistic policy to strike a desirable trade-off. In the scenario of trading, it would be interesting to see the feedback effect of prices as a result of trading. This can be applied especially in the case of high-frequency trading or market-making.
New Dataset Exploration As I mentioned in the above section, data on high-frequency trade and quotes as well as market order books are worth looking into. Reinforcement learning can be applied to better execute trading decisions.
Dabney et al. (2020) A distributional code for value in dopamine-based reinforcement learning, Nature 577, 671-675
Brief summary of the article This article shows that distributed representation of reward (value), instead of a scalar representation of reward, can be found in the dopamine neurons. Distributional reinforcement learning is a recent framework suggested in deep reinforcement learning, where the models make predictions of the expected reward of action in a distributional fashion, not in scalar fashion. This change increased the performance of deep reinforcement learnings by a large margin (e.g., on Atari benchmark). Inspired by this recent development, the researchers built a new model of reward prediction in the brain, where dopamine neurons differ in optimism level (e.g., an "optimistic" neuron has a high learning rate for positive prediction error and a low learning rate for negative prediction error). This leads to the "distributed" representation of reward, The authors showed that this model shows close resemblance to the real data collected from rodent experiment data.
Suggestion on how its method could be used to extend social science analysis The article does not generalize well to other social science fields, but I think the general lesson we can get is that this shows nicely that development in deep learning can influence models in social science domains. Individual and social structures often have structures that are extremely efficient, even if there was no "logical" or "conscious" effort in building it that way. By bringing in the innovation in terms of performance from deep learning literature, we sometimes can find a model that explains the social/individual phenomenons well. I think that is the general lesson we can get from this, not necessarily the results directly.
Describing what social data you would use to pilot such a use Again, this study will not be so generally applied to social data per se, but I think there will be some innovations in deep learning that can be applied in building models in diverse systems, especially complex systems. But I am afraid I do not really have a concrete social data example for this right now.
Summary: Exploratory data analysis (EDA) is a critical component of quantitative research, but it often is a time intensive task. El et al. leverage deep reinforcement learning to auto-generate EDA notebooks from input datasets, introducing a system they call ATENA. EDA is generally useful in the extent to which it produces insightful results -- the authors conceptualize EDA as a control problem and devise measures of notebook “interestingness” and well as diversity and coherence which are taken as a weighted sum to be a reward signal. The authors test their experimental results using the “gold standard” of human-created EDA notebooks and conduct a survey to gather evaluations of the auto-generated notebooks. While the notebooks do not reach the same levels of informativity, comprehensibility, and expertise as human notebooks, the ATENA notebooks received an average rating of 5.4/7 whereas human notebooks received 6.8/7.
Social science extension: The ATENA framework is already designed with researchers in mind. In the social sciences, many publications hinge on the presentation of novel data. ATENA could serve as a means to augment data exploration and presentation in social science research. While it is evident that ATENA does not reach human standards of EDA quality, ATENA could certainly be leveraged to produce “idea generation” notebooks that can assist a researcher in EDA. This would be especially useful with high dimensional datasets where EDA is particularly time intensive.
Dataset: Research in international political economy (IPE) seeks to understand the political dimensions of the global economy. A volume of recent research in IPE has focused on explaining variation in firm participation in global trade. This results in high dimensional data sets that include both political and economic covariates at the firm, sector, and country levels — an example data set may include covariates regarding a firm’s balance sheet and market share in certain sectors, the magnitude of political events in countries around the world, and macro-level controls including aggregate trade flows, the presence of military alliances, shared languages, etc. EDA in various research projects has produced valuable insights for the field, and ATENA could aid researchers when exploring these complex data sets.
https://openreview.net/pdf?id=B1G6uM0WG
Tactical Decision Making for Lane Changing with Deep Reinforcement Learning
Summary: This paper uses a technique known as Q-masking in their deep reinforcement learning in order to help automate the lane-changing problem in self-driving cars. Incorrect timing of lane changes naturally causes major problems in the highway experience, so it's an important problem. Their technique of Q-masking examines a subspace of Q-values, which is selected by another agent with access to information specifications. It also takes information from a low-level controller which makes sure that it doesn't crash.
Social Science Extension: This model doesn't seem directly generalizable to other social science fields. However, as-is this model could be useful for improving traffic prediction algorithms. Furthermore, this technique of Q-masking seems useful for kinds of work where one thing absolutely NEEDS to be done, so the Q-space is restricted only to those areas where it is done.
Dataset:
I want to try the technique of Q-masking with google map data (as that's what I'm working with for my final project), but I'm not confident how it would directly help. I think that this is a higher level technique, where there isn't anything general that it could be used for, but is very useful in the toolbox for specific higher level tasks.
Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-Time Bidding by Reinforcement Learning in Display Advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). Association for Computing Machinery, New York, NY, USA, 661–670. DOI:https://doi.org/10.1145/3018661.3018702
Summary: This research devises an optimized sequential bidding strategy, which makes it possible that the campaign budget can be dynamically allocated across all the available impressions, usually know for budget pacing, on the basis of the expected rewards. In this paper, the authors formulated the bidding strategy as a reinforcement learning problem - the bidding environment is characterized by the multilateral auction information and the campaign's real-time status, and the bid price is the optimal reaction to the information available. This algorithm shows superior performance and high efficiency compared to state-of-the-art methods.
Social Science Extension: Using deep learning for real-time bidding opens the possibility of extending the canonical auction theory in microeconomics, especially in the public good provision mechanism, to a broader and digitalized domain.
Dataset: As auctions are usually examined from a theoretical perspective, it is possible to operationalize the research on simulation data. It is also possible to use real-world advertising bidding data.
More on mechanism design with insights from deep learning:
Learning to communicate with deep multi-agent reinforcement learning. g Foerster, J. N., Assael, Y. M., De Freitas, N., & Whiteson, S. (2016). arXiv preprint arXiv:1605.06676.
Summary: The authors devise an experiment where autonomous RL agents are directed to maximize some notion of 'shared utility'. They design implicit frameworks of communication between agents and report on the types of behaviors 'learned' by RL agents.
They use two domains, Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL), to demonstrate different logics of embeddings as information-sharing units. The former uses deep Q-learning, while the latter exploits the fact that agents can backpropagate error derivatives through (noisy) communication channels. A big contribution of this paper is the idea of 'centralised learning but decentralised execution'.
Social science extension: The agents are essentially attempting to learn the best information sharing metric that helps them vote on the right answer in various prediction or classification problems. A natural extension to this method would be an ensemble of knowledge graph embeddings models trained on varying datasets, attempting to learn how to make sense of a true/false prediction task on a test corpus of 'facts'. Each fact can be true or false, and the goal of the ensemble is to communicate their views (based on unequal training sets of facts) in order to correctly identify true facts from false ones.
Dataset: The FEVER dataset for Fake New prediction is a social science dataset directly aligned with the fake news prediction problem. Other datasets like FB15k may be better suited to generalized multi-agent world-building experiments.
Post a reading of your own that uses deep learning for social science analysis and understanding, with a focus on deep reinforcement learning, deep agent based models, or related topics.