Collect more detailed information

captainGeech42 commented 3 years ago

Right now, only the victim organization name is grabbed. We should grab the full text of the post on the leak site, along with a screenshot. The screenshot should be included in the notification message where possible, and a config toggle should be present for whether or not a screenshot should be sent.

biligonzales commented 3 years ago

Hello @captainGeech42 I can handle the screenshot part if you want :) (and/or the rest as well)

captainGeech42 commented 3 years ago

That'd be awesome! I've been busy and haven't had a chance to work on the backlog of issues here.

I think that storing screenshots as a b64 blob in the sqlite database is the easiest way to do this, but I'm not sure. What do you think? I'm open to suggestions on a good way to handle this.

biligonzales commented 3 years ago

No problem, will do my best to help you a bit :)

Regarding the image storage, I thought about doing it in the same fashion as you (not sure it would be relevant to store them in png in on the filesystem...). We just need to make sure that it won't oversize the smallish sqlite db in the long term.

Also, we need to think about small things:

what do we do on re-run?
- do we save all screenshots?
- do we save only the latest?

ghost commented 3 years ago

Why dont just send the screenshot to the channels (discord, slack or telegram) and then remove from server? so no need to have all screenshots saved in memory

biligonzales commented 3 years ago

This could be an option too. That would only prevent the tool to provide them later (in case of #2 for instance).

biligonzales commented 3 years ago

I started to play with the addition of details (I played with Conti, adding both description and website url to the database).

I have implementation questions:

as description can be quite long (a few lines of text), and possible future screenshot storage might be heavy, shall we add columns to the existing table, or do we add a details table to the db?
do we offer the possibility to enable or disable the gathering of these details in the yaml config?
I'm not an sqlite expert, how are database schema changes handled? does it have to be migrated?

I'd be happy to have your feedback on those questions. Thanks in advance!

captainGeech42 commented 3 years ago

Based on conversations I've had with other users of this service, persistent storage of screenshots is a requirement.

Like I mentioned on #2 , a re-architecture to include a formal database layer on something bigger than SQLite (e.g., MySQL) may need to be a prior task before this can be properly implemented. In order to provide for more flexible hosting, I feel like storing the screenshots in a database rather than in filesystem/S3 or something would be easier.

Since the different shame sites provide different types of information about victims (Conti being a great example of one that provides a lot), I am leaning towards a generic details column in the Victims model. However, having structured columns could also be beneficial. Maybe a description column and a generic, unstructured metadata column is the best approach, and then each site crawler can add whatever data the actors provide, and I can just keep an eye on the implementations to make sure there is some reasonable consistency. Does that make sense? Do you agree?

For now, I am not concerned on data storage being an issue, but new config options should definitely be introduced so that users in a more resource-constrained environment can control what types of data gets collected.

SQLalchemy handles database migrations in the ORM layer, so as long as the models are properly updated, it shouldn't have issues. That being said, careful testing of any field renaming should be done, but adding new fields/models shouldn't break anything.

biligonzales commented 3 years ago

I agree with you.

Let's add then a description column which is present in many shame sites, and keep a metadata column to store anything else (in my first example for Conti, I'll dump victims' website url in it). May I suggest that we fill that metadata column with json? Or do you have any better idea?

I'm going to try to implement it for all the current site crawlers supported.

I'll do the screenshot part right after.

captainGeech42 commented 3 years ago

I would rather leave the metadata field as unstructured text, so that it can be easily consumed. Otherwise, a schema needs to be defined and such and that feels a little more structured than I think this would be.

For example, with Conti:

Victim URL: https://victim.xyz

I would prefer that to

{
    "victim_url": "https://victim.xyz"
}

Is that what you were envisioning, or am I misinterpreting?

captainGeech42 / ransomwatch

Collect more detailed information #40