DedSecInside / TorBot

Dark Web OSINT Tool
Other
2.75k stars 510 forks source link

Replace bs4 with gotor when gathering data #272

Closed KingAkeem closed 11 months ago

KingAkeem commented 1 year ago

Describe the solution you'd like HTML parsing that is done within Python, should be moved to gotor

https://github.com/DedSecInside/TorBot/blob/dev/torbot/modules/collect_data.py

{
  "id": "259113c9-2066-4b22-a7ed-36f47370602a", // a UUID
  "title": "Example", // the HTML title, check `soup.title` from BS4
  "metadata": ["example-metadata"],  // list of content from meta tags,
  "content": "this is an example entry" // content from body tag 
}

Describe alternatives you've considered Create endpoints that return individual fields. e.g. endpoint to parse HTML title, endpoint to parse metadata, etc.

I think this will request in many requests without adding much value, when the data could be consolidated into a single response and allows assigning a uuid server-side which could be useful if this gets inserted into a database.