Week 5 - Possibility Readings

UChicago-Thinking-Deep-Learning-Course / Readings-Responses

1 stars 0 forks source link

Week 5 - Possibility Readings #11

Open bhargavvader opened 3 years ago

bhargavvader commented 3 years ago

Post a reading of your own that uses deep learning for social science analysis and understanding, with a focus on network, graph, or tabular data.

Raychanan commented 3 years ago

Title: Deepfool: a simple and accurate method to fool deep neural networks

Summary: In this work, authors proposed an algorithm, DeepFool, to compute adversarial examples that fool state-of-the-art classifiers. It is based on an iterative linearization of the classifier to generate minimal perturbations that are sufficient to change classification labels. The authors provided extensive experimental evidence on three datasets and eight classifiers, showing the superiority of the proposed method over state-of-the-art methods to compute adversarial perturbations, as well as the efficiency of the proposed approach. Due to its accurate estimation of the adversarial perturbations, the proposed DeepFool algorithm provides an efficient and accurate way to evaluate the robustness of classifiers and to enhance their performance by proper fine-tuning. The proposed approach can therefore be used as a reliable tool to accurately estimate the minimal perturbation vectors, and build more robust classifiers.

Suggesting how its method could be used to extend social science analysis: I believe this is a very powerful tool for social scientists who feel interested in exploring social networks. As minor changes in some datapoints may largely change the social network research results, or we say some of them are unstable, this technique would probably help people establish robust classifiers, whereby only robust research results will be published.

Describing what social data you would use to pilot such a use: There are a ton of network datasets that can be used for this testing. For me, I tend to apply the method to the face-to-face communication networks dataset from Stanford Large Network Dataset Collection.

k-partha commented 3 years ago

Modeling Tabular Data using Conditional GAN (NeurIPS 2019)

Summary: While traditional GANs have been successful at generating highly realistic images, they struggle at generating realistic tabular data. Generating synthetic tabular data is challenging as it contains both continuous as well as discrete columns, wherein the continuous columns often have multiple modes. Discrete data is often highly imbalanced across classes and models may fail to capture the sparsely distributed lower frequency classes.

The authors devise a conditional tabular GAN that can model multimodal continuous data and discrete data in a single table. It outperforms traditional GANs as well as Bayesian methods in generating synthetic tables on two metrics - the likeliness of the synthetic and real datasets and machine learning efficiency (the performance of ML methods on real test datasets when trained on a synthetic dataset as compared to when trained on a real training dataset).

The authors find that variational autoencoders outperform their model but highlight that VAEs have direct access to the dataset while the generator in the GAN does not - which is a salient feature when privacy is a priority.

Possible social science analysis: This method could be used to augment current datasets - to increase the size of datasets when needed and appropriate. An especially interesting application is in creating highly realistic (but fake) datasets that mimic real-world social data containing highly sensitive personal information. These datasets could largely retain generalizable social patterns of interest while also providing additional security to the anonymity of the data - making it harder to deanonymize the data and provide greater bandwidth for industry-academia collaboration on sensitive social information.

Social data pilot: I would be highly interested in understanding if augmentation of real tabular datasets with synthetic data generated through this method has positive performance impacts on deep-learning based prediction tasks (especially when we don't have massive amounts of data available - such as the Twitter-based personalities project I'm working on).

cytwill commented 3 years ago

Title: An Evaluation of Knowledge Graph Embeddings for Autonomous Driving Data: Experience and Practice

Summary: This paper uses knowledge graph embeddings to represent and better understand spatiotemporal data in autonomous driving. The authors use scene ontology to structure the information from scene data and transfer it to knowledge graphs. To evaluate the quality of different knowledge graph embeddings, the authors proposed a framework for evaluation from four aspects: datasets, KGE algorithms, knowledge graphs with varying degrees of informational level and quality metrics. For datasets, they utilized two large auto-driving datasets, NuScenes and Lyft datasets. For KGE algorithms, they tried three different options: TransE, RESCAL, and HolE, where the differences exist in the treatment of the relation representations in the training score function. They generated three knowledge graphs from the scene data: a base KG; KG with inferred type relations; KG with additional include relations, where an ascending order exists between the three types of knowledge graphs. The authors achieve this by making more implicit information explicit so there are more triples in the KG with a higher information level. For the perspective of quality metrics, different from most research where the embeddings are compared via downstream tasks like link prediction, the authors borrowed the ideas of intrinsic evaluations in word embeddings and proposed three metrics: categorization measure, coherence measure, and semantic transition distance. The hypothesis for these measurements is that entities that have some conceptual backgrounds would be positioned closer in the embedding space. The results suggest that the TransE has the best performance on the evaluation metrics and reveals great stability among the two datasets. The coherence metric seems to be less helpful since most methods have a poor coherence score on different concept dimensions. Most importantly, researchers found that the embeddings generated from the KG with an increased information level have better intrinsic quality, and can be grouped into more meaningful clusters in visualization.

Extension to Other Research: Several points could be paid attention to from the process and outcomes of this research. First, the evaluation framework and metrics proposed by the authors might be useful to evaluate the embeddings generated from other deep learning methods, especially those methods analogous to word2vec (xx2vec). Second, the quality improvement brought by the increase in information suggests the potential power of implicit information, that knowledge graph researchers need to consider when they build the graphs. In the domain of social science, the above intrinsic metrics might be interpreted as a measurement of consistency or diversity of the concepts. If we have longitudinal KGs, it is possible to show how people’s ideas change towards certain concepts entities or even certain relations shared in these knowledge graphs.

New dataset exploration: Intuitively, these methods could be applied to more AD scene data and other well-formed knowledge graphs. But I think there would also be a chance to formalize the image/video-based as tabular or structured textual data, which can be further represented by a knowledge graph. This is innovated by the scene ontology used in this research. Similar scenarios might be encountered if we want to analyze people’s behaviors in video games, or if we want to find some trend or patterns in short video media (maybe through entity-entity links or entity-event links).

nwrim commented 3 years ago

Structural Deep Clustering Network 2020. Bo et al. From Proceedings of The Web Conference 2020

briefly summary of the article Deep clustering, which typically focuses on learning very efficient and effective embedding/representation of the data through deep neural networks (e.g., autoencoders), performs far better than other traditional clustering methods. Even though these methods perform greatly, the authors suggest that they can do even better by incorporating the structural information of the data as well as the powerful representations. To do this, they propose a method that utilizes both autoencoders (which are great for learning representation of the data) and Graph Convolutional Network (GCN; which are great at encoding structural data, especially in graphs-like data) through a dual self-supervised mechanism. By incorporating this structural information on top of the learned representation, they show that their model reaches greater performance than other (deep) clustering methods in 6 real-world datasets (3 datasets were network-like).
suggestion on how its method could be used to extend social science analysis Obviously, clustering is a very important topic in many social scientific inquiries. I've seen traditional methods like K-means in almost all social science disciplines that do quantitative analysis. I think even using older deep clustering methods, that mainly focus on learning powerful embedding and using the embedding results to apply more traditional clustering approaches, might be very useful for fields that do not use deep learning too much. Using this kind of method, which even uses structural information, could be even more useful for social science problems since most social scientific data have rich (whether explicit or not) structure within them. By utilizing a more advanced clustering technique like this, we might find more nuance in the data (although it will be very, very hard to interpret).
describing what social data you would use to pilot such a use I would be very interested to use this on a large-scale scholarship network database, maybe a citation network data. This kind of data will innately have topological/structural properties within it, so the proposed technique might benefit more from utilizing such information. I am especially interested to see if this clustering can find out the difference between disciplines where the boundary is "fuzzy" - for example, where do "computational social science" articles deviates from "computer science"?

pcuppernull commented 3 years ago

Popov et al. 2019. Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data.

Summary: While deep neural networks are undoubtedly the state of the art for applications like computer vision and NLP, shallow networks have continued to compete with deep networks for processing tabular data (at the time of writing). The authors propose a new deep architecture – Neural Oblivious Decision Ensembles (NODE) – which consistently outperform shallow machine learning frameworks across a range of tabular data applications. NODE generates ensembles of oblivious decision trees, which are a type of restricted decision tree that is less prone to overfitting, and benefits from gradient-based optimization and hierarchical representation. NODE outperforms gradient boosting decision trees in a variety of settings, and the authors have made its PyTorch implementation available online.

Social Science Extension: NODE presents a variety of opportunities for social science research. Most apparently, the majority of quantitative social science research is performed with tabular data. While advances in deep learning will undoubtedly encourage social science scholars to embrace alternative forms of data, tabular data will likely remain the dominant data format for the foreseeable future. This means that NODE could become a mainstay deep architecture across a variety of social science research applications.

Data Proposal: Google Trends provides detailed tabular data on the volume of specific search terms. A researcher could leverage NODE to predict the implementation and removal of COVID-19 restrictions in U.S. states using data on the frequency of COVID-19 related search terms, such as “COVID test near me” or “COVID-19 symptoms”. A tabular data set could be constructed with observations at the state-day level, with the daily volume of a variety of relevant search terms populating the columns of the data set. A researcher could train the model to predict a variety of outcome variables based on historical data, including the implementation of state-wide restrictions, future case counts, or future positivity rates.

Yilun0221 commented 3 years ago

Title: Sequential Deep Learning for Credit Risk Monitoring with Tabular Financial Data

Summary: In this paper, the researchers employed a “sequential deep learning model”to predict credit risk with a new credit card transaction sampling strategy which will serve as the foundation for improving credit scoring system and reducing financial losses. They used tabular data organized from credit card transactions where each observation/transaction have 127 features, related to the account of the transaction or the transaction itself without personal data due to privacy concerns. The data is split into training and validation datasets. To reduce the noise in data and establish data sequences in a time series manner, the researchers selected samples of each cardholder whose transaction records are not “sparse”at each timestamp. Based on the processed data, four deep learning models are used to do the classification including multilayer perceptron, TabNet, Recurrent Neural Network and Temporal Convolutional Network, where the most recent month’s transaction records are used and the performances are compared with the results given by GBDT model. Four hyperparameter tunning techniques are used, including architecture search, optimization, batch size and learning rate, and performance metrics. The models and the ensemble of each of the model and GBDT gave high Gini results but the recall performances are much lower. The models did a better job in overall data than in the high debt data. The researchers will continue to test the models in sub-group of cardholders to distinguish the performances on different customers.

Expansions to social science analysis: Scientifically, I think the models will perform better if demographics data are used since it is highly related to people’s consumption habits. The exploration can be used in other financial systems but also in other macroeconomics fields so that the government can use these models to conduct more scientific and accurate policies.

New dataset exploration: I think the models can be used to better government macro-control of public service funding. For example, I think data about people’s willingness to have children and the ground truth of the numbers of children in different families can be used to help the policy-maker conduct better child and education policy .

bakerwho commented 3 years ago

Bianchi, Federico, Gaetano Rossiello, Luca Costabello, Matteo Palmonari, and Pasquale Minervini. 2020. “Knowledge Graph Embeddings and Explainable AI.” ArXiv:2004.14843 [Cs], April. https://doi.org/10.3233/SSW200011.

Summary This 2020 paper is an excellent summary of Knowledge Graph Embeddings algorithms and the unique value they add to explainability in algorithmic models. The paper covers translational models like TransE, bilinear models like DistMult, HolE, CompleX and ResCal, neural methods like ConvE and many more. It also deals with how knowledge graphs are based on the operationalization of relationships in very specific ways that are ripe for mathematical interpretation, and has an extensive section on the limitations of such approaches.

The COKE algorithm that preserves 'contextualized' embeddings for entities and relationships is particularly interesting for its social science applications.

Social science extension It would be interesting to explore Knowledge Graph Embeddings in social use cases, such as ones trained on parsed news data. Can embeddings of facts reflect political bias? Can embeddings form their own echo chambers? Can embeddings be used for contextual fact checking?

New dataset exploration If we could use other methods like AutoKG to parse entity-relationship triples from unstructured news data, we would have the whole world of KGE methods available to experiment with. For now, we are restricted to subsets such as the Freebase dataset, or others that are parsed from Wikipedia.

jsoll1 commented 3 years ago

Transfer Learning with Graph Neural Networks for Short-Term Highway Traffic Forecasting. (Preprint). Tanwi Mallick, Prasanna Balaprakash, Eric Rask, and Jane Macfarlane. https://arxiv.org/pdf/2004.08038.pdf

Summary: Explores a new method of transfer learning to use a model adapted on one part of a highway on other parts of a highway. Specifically, they're using a Diffusion Convolutional Recurrent Neural Network. They adapt a new transfer learning approach for it, where they use models on parts of teh highway with plenty of traffic on other parts of the highway. This is a pretty exciting extension of this method for extrapolation within a highway traffic forecasting system.

Social Science Extension: In social science problems there are frequently limits to the data that one can collect. I remember seeing classmates where a lot of data is found about certain kinds of twitter profiles with MBTI written but not FFM models. Perhaps these kinds of transfer models, while not as a convolutional neural network, can allow for predictions of types of users with less content.

New dataset exploration: This is especially exciting for me since William and I are considering doing our final project in this class with traffic and google streetmap data. We could use this technique to severely cut down on the data we have to process if we choose to estimate expected traffic on routes. In that case we'd use this on a dataset of google street images linked to their location and the number of cars object detection algorithms were able to find.

william-wei-zhu commented 3 years ago

Title: In-Depth Analysis of Railway and Company Evolution of Yangtze River Delta with Deep Learning

Summary: This project combines geographic location data of company registry with railway construction data in Yangtze River Delta to construct deep learning models and conduct correlation analysis over time and space. The research found that construction and renovation of railway and railway stations had greatly contributed to the increase in company density in local regions.

Social Science extension: It will be interesting to compare the impact of railway construction and high way construction on economic growth in local regions.

new data: data on high way construction in China. Compare the impact of railroad and highway to economic activity in China with other countries.

hesongrun commented 3 years ago

The Knowledge Graph for Macroeconomic Analysis with Alternative Big Data https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3707964

Summary: This paper builds a knowledge graph of the linkage between traditional economic variables and alternative data variables. The RDF triples are extracted from academic literature and industry reports. They follow this systematic approach: (1) make a list of aggregate variables of interest together with their variants; (2) find all these variables and their variants in the document with string matching; (3) for each aggregate variable detected in the documents, find all the other variables around it, as well as the relation among aggregate variables and other variables. (4) represent all the variables and relations extracted with the RDF triple structure. Since the named entity recognition can be very hard for economic variables. The authors developed an active learning algorithm with human involvement to extract variable entities and relation keywords from the textual data. With the guidance of the knowledge graph, the authors found that they did a better job in selecting variables to predict China's monthly inflation rate and nominal investment time series in the long term.

Social Science Extension: With the extracted knowledge graph, it would be interesting to study some of the mechanisms behind the transmission of certain economic policies and cross-validate these findings with traditional model-based economics theories. Nobel laureate Robert Shiller put forward this narrative economics. It will be interesting to look into the time-series dynamics of the knowledge graph to see whether it captures the evolvement of narratives among agents in economics and whether there is some feedback loop between the narratives and the real economic outcomes.

New Data: The authors used academic papers and industry reports which are pretty novel. One can also consider looking into the outlook published by world bank economists or IMF to take a global perspective.