PAN Dataset Files Structure

hosseinfani commented 2 years ago

PAN: what this stands for?

https://pan.webis.de/clef12/pan12-web/sexual-predator-identification.html https://pan.webis.de/downloads/publications/papers/inches_2012.pdf

In order to download the PAN12 dataset for sexual predator identification problems, use this link. You have to request for access and it might take a few days, so keep it in mind if you are in hurry.

Labels for Test Set:

ids of predators (one per line). pan12-sexual-predator-identification-groundtruth-problem1.txt
suspicious (of a perverted behavior) messages: ids of conversations and message line# in the conversation pan12-sexual-predator-identification-groundtruth-problem2.txt

Labels for Train Set:

ids of predators (one per line). pan12-sexual-predator-identification-training-corpus-predators-2012-05-01.txt

Stats:

NOTES:

the 'n' in some of the rows refer to the number
These stats were updated using the code in this commit.

Stat	Train	Test	Test ∪ Train
average conversations per predator	14.197	14.713	28.910
average number of conversation's messages per n author & n==1	2.357	2.395	4.752
average number of conversation's messages per n author & n==2	11.633	11.187	22.820
average number of conversation's messages per n author & n>2	40.579	40.786	81.365
average number of predatory conversation's messages per n author & n==1	1.744	1.945	3.689
average number of predatory conversation's messages per n author & n==2	71.062	64.708	135.770
average number of predatory conversation's messages per n author & n>2	81.200	nan	nan
number of chatters	97689.000	218702.000	316391.000
number of conversations	66927.000	155128.000	222055.000
number of conversations per n author & n==1	12773.000	29561.000	42334.000
number of conversations per n author & n==2	45741.000	105862.000	151603.000
number of conversations per n author & n>2	8413.000	19705.000	28118.000
number of conversations with m messages & m<=1	7289.000	16712.000	24001.000
number of conversations with m messages & m==2	2644.000	6080.000	8724.000
number of conversations with m messages & m==3	13133.000	30711.000	43844.000
number of conversations with m messages & m==4	18302.000	42501.000	60803.000
number of conversations with m messages & m>=5	25559.000	59124.000	84683.000
number of messages	903607.000	2058781.000	2962388.000
number of predatory chatters	142.000	254.000	396.000
number of predatory conversations	2016.000	3737.000	5753.000
number of predatory conversations per n author & n==1	923.000	1850.000	2773.000
number of predatory conversations per n author & n==2	1088.000	1887.000	2975.000
number of predatory conversations per n author & n>2	5.000	0.000	5.000
number of predatory conversations with m messages & m<=1	592.000	1132.000	1724.000
number of predatory conversations with m messages & m==2	221.000	400.000	621.000
number of predatory conversations with m messages & m==3	110.000	208.000	318.000
number of predatory conversations with m messages & m==4	46.000	106.000	152.000
number of predatory conversations with m messages & m>=5	1047.000	1891.000	2938.000

hosseinfani commented 2 years ago

@M-MoeedKhalid could you fill out the missing stats?

hosseinfani commented 1 year ago

@rezaBarzgar 1- When the code for the stats are done, please push and link it with this issue 2- Also, create a readme.md file in ./data folder and put similar table there, including the links, etc. Later we refer to the readme about anything related to the dataset.

When done, let me know so we can safely close this issue.

rezaBarzgar commented 1 year ago

@hosseinfani Sure, will do that tomorrow and let you know.

rezaBarzgar commented 1 year ago

@hosseinfani

I completed get_stats() function in main.py.
Also added readme and data-sandbox in ./data
I believe data has many bugs. for example, there are conversations that have tagged_conv == 1 & tagged_msg == 1 but there is no predator in those conversations. I mean there are some conversations that have incorrect labels. In addition, there is only one author that labeled predator in each train set and test set. It doesn't sound correct to me.

hosseinfani commented 1 year ago

@rezaBarzgar Thank you.

@hosseinfani

I completed get_stats() function in main.py.

Also added readme and data-sandbox in ./data

Thank you.

I believe data has many bugs. for example, there are conversations that have tagged_conv == 1 & tagged_msg == 1 but there is no predator in those conversations. I mean there are some conversations that have incorrect labels. In addition, there is only one author that labeled predator in each train set and test set. It doesn't sound correct to me.

Are you sure? This is fo the toy set.

rezaBarzgar commented 1 year ago

@hosseinfani Hi, there was a bug in loading predators' id into a dataframe, I fixed it, and now data is extracted without any problem. I will push the correct code and update the stats

hosseinfani commented 1 year ago

@rezaBarzgar thanks. I left an inline comment here please have a look

hamedwaezi01 commented 1 year ago

Hi Recently, we discovered some issues with reading the data and fixed them. Since the extracted dataset has changed, I want to update the stats I retrieved from the newly processed dataset. I got the stats for train and test sets separately and here we go:

NOTES: the 'n' in some of the rows refer to the number	Stat	Train
number of chatters	97689	218702
number of predator chatters	142	254
number of conversations	66927	155128
number of messages	903607	2058781
number of predatory conversations	2016	3737
average conversations per predator	14.197	14.712
average number of conversations messages per n author n == 1	2.357	2.395
average number of conversations messages per n author n == 2	11.633	11.187
average number of conversations messages per n author n > 2	40.578	40.786
average number of predatory conversations messages per n author n == 1	1.744	1.944
average number of predatory conversations messages per n author n == 2	71.061	64.708
average number of predatory conversations messages per n author n > 2	81.200	NaN
number of conversations per n author n == 1	12773	29561
number of conversations per n author n == 2	45741	105862
number of conversations per n author n > 2	8413	19705
number of predatory conversations per n author n == 1	923	1850
number of predatory conversations per n author n == 2	1088	1887
number of predatory conversations per n author n > 2	5	0

Please let me know if anything needs explanation. These stats were generated using the code in this commit.

hamedwaezi01 commented 1 year ago

I analyzed the stats and the dataset for some insights. (Would be probably updated soon)

1. Predators with insignificant interactions

I set a threshold and checked how many unique predator authors we have for conversations of different sizes. Also I checked if some predators had conversations of few messages and never appeared in long conversations; and the answer was yes we have some of them. These are their author ids (author_id column):

'35d61fe88c3572f11a577e7a04be2140' '492bece78953e94ea30ac194609a16d6' '53c62668407d0f5a068a42903fd98984' '5904488cf6bfcd01beaf225ac00efd99' '7b38314806035fbc0d66afbf5018d975' 'ab2fc95662942aa8d03c1da3e7374fd2' 'd2f9cb5682214911bc17888ca80521ee' 'd50f114dde2edb12b72ecea83ebf63ce'

These author ids are from both training and test sets. We cannot make sure how these conversations were generated but my best guess is that some predatory conversations were sliced due to their size or the delay between messages. Then the author id of the predator (and maybe all chatters) was/were changed in these conversations between the same authors.

2. Unbalanced distribution of entities

The PAN12 dataset is extremely unbalanced. According to the table presented in the first comment of this issue, the ratios of labelled predatory and non-predatory conversations in train, test, and the combination of train-test were 3.012%, 2.408%, 2.590% respectively.

3. Insignificant conversations

The grouped data by number of authors present in a conversation shows significant insights. We can see the ratio of conversations with 1 author to all conversations is around 20% for all three datasets. Additionally, on average the number of messages in conversations with 1 author is about 2.3 while that of 2 authors is 11.3. These stats for predatory conversations change a lot and it is clear that the significance of 1 author conversations becomes less. Based on this insight, we can omit records with less than 2 chatters. The number of predatory conversations with more than 2 authors is drastically smaller than the other two categories. It is deductible that we can rule out conversations with more than 2 authors, as they cannot be predatory. Therefore, we will only feed the conversations between two chatters to the model.

hamedwaezi01 commented 1 year ago

I accidentally noticed the text of many messages contains some author id. I want to preprocess them, either by removing them or changing them to a special token like <author_id>. I think the latter could result in less data loss since, as far as I know, these author ids mean the corresponding users are mentioned in the text. I want to know your idea too.

hosseinfani commented 1 year ago

@hamedwaezi01 agree. just change them to a special token like you said.

hamedwaezi01 commented 1 year ago

Per our last conversation, I'd like to suggest that we reduce some of the types from our feature vectors (garbage types and tokens). The main criteria would be the frequency of the tokens. So we say the 10K most frequent types should be considered in the feature vector. This approach will lead in faster computations and smaller model size. One might say, this filter leads to information loss, but a non-frequent token (for example, only happened two time) does not have much to say. Additionally, we can force the encoder to consider some tokens that the train dataset suggest they are important. For example, the range of numbers between 5 to 18, which are the ages for teenagers and kids.

hosseinfani commented 1 year ago

@hamedwaezi01 I agree. Also, as long as the effect is the same across baselines (make all better), that's ok. However, if the effect is positive on your model but negative on others, that's not fair.

hamedwaezi01 commented 10 months ago

I want to list a number of conversation-level features here, either implemented or not, just to keep track of it: (it Will be updated over time, hopefully)

Number of authors
Number of all messages
Turn-taking
Normalized timestamp of the message (in minutes) [0, 1]
A flag whether the author of a message started the conversation or not

fani-lab / Osprey