fani-lab / Osprey

Online Predatory Conversation Detection
0 stars 0 forks source link

adding predatory conversations to PAN2012 #20

Open rezaBarzgar opened 1 year ago

rezaBarzgar commented 1 year ago

The PAN dataset is for 2012. We know that predatory conversations of PAN are from perverted-justice. However, Here are some conversations that had been added after 2012. We should check these conversations. If they are not in the dataset, I think it's a good idea to add them.

hosseinfani commented 1 year ago

@rezaBarzgar Nice catch. Do you think this would be a task for @EhsanSl , that is writing a crawler and creating a new dataset with the same XML format as the old one?

rezaBarzgar commented 1 year ago

@hosseinfani I think that's a good idea. These conversations can be useful.

rezaBarzgar commented 1 year ago

@EhsanSl Hi Ehsan, I assigned you to this issue.

The objective is to expand the dataset with conversations that are not in PAN2012. As Dr. Fani mentioned, you must write a crawler that extracts new data from perverted justice with the same XML format.

We can have a meeting if you need any information.

Please submit your progress on this issue

EhsanSl commented 1 year ago

nice try guys! 🥲 image

EhsanSl commented 1 year ago

Hi dear Reza, I hope everything is going well brother, I haven't forgotten about the tasks, but frankly, I'm extremely swamped this week, and have a few too many deadlines, I'll make up for it from the reading week. x)  

EhsanSl commented 1 year ago

Hi dear Dr. Hossein and dear Reza I attached the crawler here neuralcrawing_.zip , there are a few points that are worth mentioning []since in all the chat logs, the conversations are not separated by a distinct box, there was not a definite way to extract each individually (to my understanding), [] the date is sometimes placed inside the conversation div and sometimes between them, which makes it a bit challenging [] the time formatting is not consistent for all the convicts [] to access the crawler file: neuralcrawling/neuralcrawling/spiders/justice_spider.py [] to run the crawler, make sure the terminal path is: 'neuralcrawling/neuralcrawling' then run the following command in the terminal: scrapy crawl jspider -o output.json