Open rezaBarzgar opened 1 year ago
@rezaBarzgar Nice catch. Do you think this would be a task for @EhsanSl , that is writing a crawler and creating a new dataset with the same XML format as the old one?
@hosseinfani I think that's a good idea. These conversations can be useful.
@EhsanSl Hi Ehsan, I assigned you to this issue.
The objective is to expand the dataset with conversations that are not in PAN2012. As Dr. Fani mentioned, you must write a crawler that extracts new data from perverted justice with the same XML format.
We can have a meeting if you need any information.
Please submit your progress on this issue
nice try guys! 🥲
Hi dear Reza, I hope everything is going well brother, I haven't forgotten about the tasks, but frankly, I'm extremely swamped this week, and have a few too many deadlines, I'll make up for it from the reading week. x) Â
Hi dear Dr. Hossein and dear Reza
I attached the crawler here neuralcrawing_.zip
, there are a few points that are worth mentioning
[]since in all the chat logs, the conversations are not separated by a distinct box, there was not a definite way to extract each individually (to my understanding),
[] the date is sometimes placed inside the conversation div and sometimes between them, which makes it a bit challenging
[] the time formatting is not consistent for all the convicts
[] to access the crawler file: neuralcrawling/neuralcrawling/spiders/justice_spider.py
[] to run the crawler, make sure the terminal path is: 'neuralcrawling/neuralcrawling'
then run the following command in the terminal: scrapy crawl jspider -o output.json
The PAN dataset is for 2012. We know that predatory conversations of PAN are from perverted-justice. However, Here are some conversations that had been added after 2012. We should check these conversations. If they are not in the dataset, I think it's a good idea to add them.