We provide the depression labeled dataset on Twitter posts consist of multiple languages(Korean, English and Japanese). The datasets were collected using Twitter APIs with community-based random sampling approaches and our datasets consist of 921k tweets from Korean users, 10M tweets from English users and 15M tweets from Japanese users. Each language depression dataset was labeld depression(1) or non-depression(0) with depression lexicon, which was collected from prior studies that related to the detection of depression on social media. In addition, we applied our model to specific group(e.g. university community) to detect not only general social media posts but also specific groups posts.
Figure : From data collcetion to classification
Our sampled Twitter and Everytime datasets are in data
folder as gitHub limits the size of files allowed in repositories. We are only allowed to distribute the data for the research purpose, if you want to achieve full datasets, please complete the request form(update later).
We employed deep learning framework to classify depression posts in the social media(Tiwtter) and university community(Everytime). Within the models folder we uploaded binary classification models for each language. We achived ranges form 99.39% to 99.66% f1-score for each language in detecting depression posts on general social media and 72.80% to 99.12% f1-score for university community in South Korea. In addition, to address the generality of Twitter dataset, after we trained all employed models with Twitter dataset, we tested them with Eveyrtime dataset. The BERT-based classification model reported the highest F1-score 64.51%.
Within the model
folder we uploaded classification models to detect depression posts.