Open hosseinfani opened 2 years ago
@M-MoeedKhalid could you fill out the missing stats?
@rezaBarzgar 1- When the code for the stats are done, please push and link it with this issue 2- Also, create a readme.md file in ./data folder and put similar table there, including the links, etc. Later we refer to the readme about anything related to the dataset.
When done, let me know so we can safely close this issue.
@hosseinfani Sure, will do that tomorrow and let you know.
@hosseinfani
get_stats()
function in main.py.tagged_conv == 1
& tagged_msg == 1
but there is no predator in those conversations. I mean there are some conversations that have incorrect labels. In addition, there is only one author that labeled predator in each train set and test set. It doesn't sound correct to me.@rezaBarzgar Thank you.
@hosseinfani
- I completed
get_stats()
function in main.py.- Also added readme and data-sandbox in ./data
Thank you.
- I believe data has many bugs. for example, there are conversations that have
tagged_conv == 1
&tagged_msg == 1
but there is no predator in those conversations. I mean there are some conversations that have incorrect labels. In addition, there is only one author that labeled predator in each train set and test set. It doesn't sound correct to me.
Are you sure? This is fo the toy set.
@hosseinfani Hi, there was a bug in loading predators' id into a dataframe, I fixed it, and now data is extracted without any problem. I will push the correct code and update the stats
@rezaBarzgar thanks. I left an inline comment here please have a look
Hi Recently, we discovered some issues with reading the data and fixed them. Since the extracted dataset has changed, I want to update the stats I retrieved from the newly processed dataset. I got the stats for train and test sets separately and here we go:
NOTES: the 'n' in some of the rows refer to the number | Stat | Train | Test |
---|---|---|---|
number of chatters | 97689 | 218702 | |
number of predator chatters | 142 | 254 | |
number of conversations | 66927 | 155128 | |
number of messages | 903607 | 2058781 | |
number of predatory conversations | 2016 | 3737 | |
average conversations per predator | 14.197 | 14.712 | |
average number of conversations messages per n author n == 1 | 2.357 | 2.395 | |
average number of conversations messages per n author n == 2 | 11.633 | 11.187 | |
average number of conversations messages per n author n > 2 | 40.578 | 40.786 | |
average number of predatory conversations messages per n author n == 1 | 1.744 | 1.944 | |
average number of predatory conversations messages per n author n == 2 | 71.061 | 64.708 | |
average number of predatory conversations messages per n author n > 2 | 81.200 | NaN | |
number of conversations per n author n == 1 | 12773 | 29561 | |
number of conversations per n author n == 2 | 45741 | 105862 | |
number of conversations per n author n > 2 | 8413 | 19705 | |
number of predatory conversations per n author n == 1 | 923 | 1850 | |
number of predatory conversations per n author n == 2 | 1088 | 1887 | |
number of predatory conversations per n author n > 2 | 5 | 0 |
Please let me know if anything needs explanation. These stats were generated using the code in this commit.
I analyzed the stats and the dataset for some insights. (Would be probably updated soon)
I set a threshold and checked how many unique predator authors we have for conversations of different sizes. Also I checked if some predators had conversations of few messages and never appeared in long conversations; and the answer was yes we have some of them. These are their author ids (author_id column):
'35d61fe88c3572f11a577e7a04be2140' '492bece78953e94ea30ac194609a16d6' '53c62668407d0f5a068a42903fd98984' '5904488cf6bfcd01beaf225ac00efd99' '7b38314806035fbc0d66afbf5018d975' 'ab2fc95662942aa8d03c1da3e7374fd2' 'd2f9cb5682214911bc17888ca80521ee' 'd50f114dde2edb12b72ecea83ebf63ce'
These author ids are from both training and test sets. We cannot make sure how these conversations were generated but my best guess is that some predatory conversations were sliced due to their size or the delay between messages. Then the author id of the predator (and maybe all chatters) was/were changed in these conversations between the same authors.
The PAN12 dataset is extremely unbalanced. According to the table presented in the first comment of this issue, the ratios of labelled predatory and non-predatory conversations in train, test, and the combination of train-test were 3.012%, 2.408%, 2.590% respectively.
The grouped data by number of authors present in a conversation shows significant insights. We can see the ratio of conversations with 1 author to all conversations is around 20% for all three datasets. Additionally, on average the number of messages in conversations with 1 author is about 2.3 while that of 2 authors is 11.3. These stats for predatory conversations change a lot and it is clear that the significance of 1 author conversations becomes less. Based on this insight, we can omit records with less than 2 chatters. The number of predatory conversations with more than 2 authors is drastically smaller than the other two categories. It is deductible that we can rule out conversations with more than 2 authors, as they cannot be predatory. Therefore, we will only feed the conversations between two chatters to the model.
I accidentally noticed the text of many messages contains some author id. I want to preprocess them, either by removing them or changing them to a special token like <author_id>
. I think the latter could result in less data loss since, as far as I know, these author ids mean the corresponding users are mentioned in the text.
I want to know your idea too.
@hamedwaezi01 agree. just change them to a special token like you said.
Per our last conversation, I'd like to suggest that we reduce some of the types from our feature vectors (garbage types and tokens). The main criteria would be the frequency of the tokens. So we say the 10K most frequent types should be considered in the feature vector. This approach will lead in faster computations and smaller model size. One might say, this filter leads to information loss, but a non-frequent token (for example, only happened two time) does not have much to say. Additionally, we can force the encoder to consider some tokens that the train dataset suggest they are important. For example, the range of numbers between 5 to 18, which are the ages for teenagers and kids.
@hamedwaezi01 I agree. Also, as long as the effect is the same across baselines (make all better), that's ok. However, if the effect is positive on your model but negative on others, that's not fair.
I want to list a number of conversation-level features here, either implemented or not, just to keep track of it: (it Will be updated over time, hopefully)
PAN: what this stands for?
https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html https://pan.webis.de/downloads/publications/papers/inches_2012.pdf
In order to download the PAN12 dataset for sexual predator identification problems, use this link. You have to request for access and it might take a few days, so keep it in mind if you are in hurry.
Labels for Test Set:
ids of predators (one per line). pan12-sexual-predator-identification-groundtruth-problem1.txt
suspicious (of a perverted behavior) messages: ids of conversations and message line# in the conversation pan12-sexual-predator-identification-groundtruth-problem2.txt
Labels for Train Set:
Stats:
NOTES: