fani-lab / Osprey

Online Predatory Conversation Detection
0 stars 0 forks source link

PAN Dataset Files Structure #3

Open hosseinfani opened 2 years ago

hosseinfani commented 2 years ago

PAN: what this stands for?

https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html https://pan.webis.de/downloads/publications/papers/inches_2012.pdf

In order to download the PAN12 dataset for sexual predator identification problems, use this link. You have to request for access and it might take a few days, so keep it in mind if you are in hurry.

Labels for Test Set:

Labels for Train Set:

Stats:

NOTES:

Stat Train Test Test ∪ Train
average conversations per predator 14.197 14.713 28.910
average number of conversation's messages per n author & n==1 2.357 2.395 4.752
average number of conversation's messages per n author & n==2 11.633 11.187 22.820
average number of conversation's messages per n author & n>2 40.579 40.786 81.365
average number of predatory conversation's messages per n author & n==1 1.744 1.945 3.689
average number of predatory conversation's messages per n author & n==2 71.062 64.708 135.770
average number of predatory conversation's messages per n author & n>2 81.200 nan nan
number of chatters 97689.000 218702.000 316391.000
number of conversations 66927.000 155128.000 222055.000
number of conversations per n author & n==1 12773.000 29561.000 42334.000
number of conversations per n author & n==2 45741.000 105862.000 151603.000
number of conversations per n author & n>2 8413.000 19705.000 28118.000
number of conversations with m messages & m<=1 7289.000 16712.000 24001.000
number of conversations with m messages & m==2 2644.000 6080.000 8724.000
number of conversations with m messages & m==3 13133.000 30711.000 43844.000
number of conversations with m messages & m==4 18302.000 42501.000 60803.000
number of conversations with m messages & m>=5 25559.000 59124.000 84683.000
number of messages 903607.000 2058781.000 2962388.000
number of predatory chatters 142.000 254.000 396.000
number of predatory conversations 2016.000 3737.000 5753.000
number of predatory conversations per n author & n==1 923.000 1850.000 2773.000
number of predatory conversations per n author & n==2 1088.000 1887.000 2975.000
number of predatory conversations per n author & n>2 5.000 0.000 5.000
number of predatory conversations with m messages & m<=1 592.000 1132.000 1724.000
number of predatory conversations with m messages & m==2 221.000 400.000 621.000
number of predatory conversations with m messages & m==3 110.000 208.000 318.000
number of predatory conversations with m messages & m==4 46.000 106.000 152.000
number of predatory conversations with m messages & m>=5 1047.000 1891.000 2938.000
hosseinfani commented 2 years ago

@M-MoeedKhalid could you fill out the missing stats?

hosseinfani commented 1 year ago

@rezaBarzgar 1- When the code for the stats are done, please push and link it with this issue 2- Also, create a readme.md file in ./data folder and put similar table there, including the links, etc. Later we refer to the readme about anything related to the dataset.

When done, let me know so we can safely close this issue.

rezaBarzgar commented 1 year ago

@hosseinfani Sure, will do that tomorrow and let you know.

rezaBarzgar commented 1 year ago

@hosseinfani

hosseinfani commented 1 year ago

@rezaBarzgar Thank you.

@hosseinfani

Thank you.

  • I believe data has many bugs. for example, there are conversations that have tagged_conv == 1 & tagged_msg == 1 but there is no predator in those conversations. I mean there are some conversations that have incorrect labels. In addition, there is only one author that labeled predator in each train set and test set. It doesn't sound correct to me.

Are you sure? This is fo the toy set.

rezaBarzgar commented 1 year ago

@hosseinfani Hi, there was a bug in loading predators' id into a dataframe, I fixed it, and now data is extracted without any problem. I will push the correct code and update the stats

hosseinfani commented 1 year ago

@rezaBarzgar thanks. I left an inline comment here please have a look

hamedwaezi01 commented 1 year ago

Hi Recently, we discovered some issues with reading the data and fixed them. Since the extracted dataset has changed, I want to update the stats I retrieved from the newly processed dataset. I got the stats for train and test sets separately and here we go:

NOTES: the 'n' in some of the rows refer to the number Stat Train Test
number of chatters 97689 218702
number of predator chatters 142 254
number of conversations 66927 155128
number of messages 903607 2058781
number of predatory conversations 2016 3737
average conversations per predator 14.197 14.712
average number of conversations messages per n author n == 1 2.357 2.395
average number of conversations messages per n author n == 2 11.633 11.187
average number of conversations messages per n author n > 2 40.578 40.786
average number of predatory conversations messages per n author n == 1 1.744 1.944
average number of predatory conversations messages per n author n == 2 71.061 64.708
average number of predatory conversations messages per n author n > 2 81.200 NaN
number of conversations per n author n == 1 12773 29561
number of conversations per n author n == 2 45741 105862
number of conversations per n author n > 2 8413 19705
number of predatory conversations per n author n == 1 923 1850
number of predatory conversations per n author n == 2 1088 1887
number of predatory conversations per n author n > 2 5 0

Please let me know if anything needs explanation. These stats were generated using the code in this commit.

hamedwaezi01 commented 1 year ago

I analyzed the stats and the dataset for some insights. (Would be probably updated soon)

1. Predators with insignificant interactions

I set a threshold and checked how many unique predator authors we have for conversations of different sizes. Also I checked if some predators had conversations of few messages and never appeared in long conversations; and the answer was yes we have some of them. These are their author ids (author_id column):

'35d61fe88c3572f11a577e7a04be2140' '492bece78953e94ea30ac194609a16d6' '53c62668407d0f5a068a42903fd98984' '5904488cf6bfcd01beaf225ac00efd99' '7b38314806035fbc0d66afbf5018d975' 'ab2fc95662942aa8d03c1da3e7374fd2' 'd2f9cb5682214911bc17888ca80521ee' 'd50f114dde2edb12b72ecea83ebf63ce'

These author ids are from both training and test sets. We cannot make sure how these conversations were generated but my best guess is that some predatory conversations were sliced due to their size or the delay between messages. Then the author id of the predator (and maybe all chatters) was/were changed in these conversations between the same authors.

2. Unbalanced distribution of entities

The PAN12 dataset is extremely unbalanced. According to the table presented in the first comment of this issue, the ratios of labelled predatory and non-predatory conversations in train, test, and the combination of train-test were 3.012%, 2.408%, 2.590% respectively.

3. Insignificant conversations

The grouped data by number of authors present in a conversation shows significant insights. We can see the ratio of conversations with 1 author to all conversations is around 20% for all three datasets. Additionally, on average the number of messages in conversations with 1 author is about 2.3 while that of 2 authors is 11.3. These stats for predatory conversations change a lot and it is clear that the significance of 1 author conversations becomes less. Based on this insight, we can omit records with less than 2 chatters. The number of predatory conversations with more than 2 authors is drastically smaller than the other two categories. It is deductible that we can rule out conversations with more than 2 authors, as they cannot be predatory. Therefore, we will only feed the conversations between two chatters to the model.

hamedwaezi01 commented 1 year ago

I accidentally noticed the text of many messages contains some author id. I want to preprocess them, either by removing them or changing them to a special token like <author_id>. I think the latter could result in less data loss since, as far as I know, these author ids mean the corresponding users are mentioned in the text. I want to know your idea too.

hosseinfani commented 1 year ago

@hamedwaezi01 agree. just change them to a special token like you said.

hamedwaezi01 commented 1 year ago

Per our last conversation, I'd like to suggest that we reduce some of the types from our feature vectors (garbage types and tokens). The main criteria would be the frequency of the tokens. So we say the 10K most frequent types should be considered in the feature vector. This approach will lead in faster computations and smaller model size. One might say, this filter leads to information loss, but a non-frequent token (for example, only happened two time) does not have much to say. Additionally, we can force the encoder to consider some tokens that the train dataset suggest they are important. For example, the range of numbers between 5 to 18, which are the ages for teenagers and kids.

hosseinfani commented 1 year ago

@hamedwaezi01 I agree. Also, as long as the effect is the same across baselines (make all better), that's ok. However, if the effect is positive on your model but negative on others, that's not fair.

hamedwaezi01 commented 10 months ago

I want to list a number of conversation-level features here, either implemented or not, just to keep track of it: (it Will be updated over time, hopefully)

  1. Number of authors
  2. Number of all messages
  3. Turn-taking
  4. Normalized timestamp of the message (in minutes) [0, 1]
  5. A flag whether the author of a message started the conversation or not