Open JunsolKim opened 2 years ago
It is an inspiring paper that uses bunch of advanced deep learning techniques to deal with unstructured data in a philanthropic online market. The paper explains the rationale for selecting models in detail; however, it is unclear to me how do they link the description data to their conclusion of "forming group loans has a significant treatment effect on funding time"? I also wonder is there any linguistic patterns or social games can be analyzed further based on description data (such as borrower's gender, age, detailed plan for loan vs abstract plan, etc)?
In this paper the authors chose to use the Wikipedia pre-trained GloVe vectors, rather than to create the embeddings from the corpus. What are the pros and cons of this decision, and how should we decide when to use pre-trained vectors and when to train the vectors on the data in question?
In terms of the causal inference setting, the author argue unconfoundedness hold by stating that: If this endogeneity is large, then everyone would think forming group loans will speed up the funding process and everyone will tend to do so (following the practice in traditional microfinance); we know that this is not the case here. Hence, we can assume that unconfoundedness holds. I am still confused about this statement. For one thing, lack of evidence that everyone forms group loans is not sufficient for small endogeneity. For another, it's very hard to control for all x that may correlates with the potential outcomes and treatment effect. Therefore, maybe using propensity score match would be a better idea?
Thanks!
While reading the paper, I wondered why the authors did not extract additional information from the description (e.g., text length, sentiment, and the borrowers' personal background), which could possibly serve as covariates. Also, I would appreciate clarification on the feature selection and causal inference setup of the paper, it seems a bit arbitrary to me.
In this paper the authors chose to use the Wikipedia pre-trained GloVe vectors, rather than to create the embeddings from the corpus. What are the pros and cons of this decision, and how should we decide when to use pre-trained vectors and when to train the vectors on the data in question?
I think the pro of using a pre-trained model are:
The cons of a pre-trained model are:
Post questions here for this week's exemplary readings: 2. Pham, Thai T. and Yuanyuan Shen. 2017. “A Deep Causal Inference Approach to Measuring the Effects of Forming Group Loans in Online Non-profit Microfinance Platform”. arXiv.org preprint: 1706.02795.