DedSecInside / TorBot

Dark Web OSINT Tool
Other
2.94k stars 536 forks source link

Collect data | Machine Learning Phase I #162

Closed KingAkeem closed 5 years ago

KingAkeem commented 5 years ago

Issue #161

Changes Proposed

Explanation of Changes

Save entries to csv file using the subjects of ID | TITLE | META TAGS | CONTENT

KingAkeem commented 5 years ago

The title is being saved property, I'm grabbing the text. https://github.com/DedSecInside/TorBot/blob/d18083f36be0f18e5255352f98d0f1c35ffb5ab7/modules/collect_data.py#L59

KingAkeem commented 5 years ago

This is ready to be re-reviewed

PSNAppz commented 5 years ago
  1. We just need to change the content part. Every website contains a <meta content="some description" name="description"> tag. We just need that information. Or if this is empty we could just grab the contents inside <body> tag. This way all the noise is removed.

This is still not fixed?

KingAkeem commented 5 years ago

That was done in this commit https://github.com/DedSecInside/TorBot/pull/162/commits/652007392752d48c29b38901662e06a08c07df3c

PSNAppz commented 5 years ago

Ready for review?

KingAkeem commented 5 years ago

Yep yep