Get risks as a proper reference dataset

aaronkaplan commented 8 years ago

Currently on the server we have a table risks in the postgresql DB. Please dump it as CSV and put it to the reference data repo so that we have a clearly defined mapping risk_id -> risk name

Thx

rufuspollock commented 8 years ago

@chorsley i think it would probably be better to port the google spreadsheet we already have since that has additional info.

What do people think?

chorsley commented 8 years ago

I've added to S3 @ /dev/clean/risk_ids.json for the moment, based on the Postgres table. (Note - I don't have rights to close the issue, so please feel free if you're satisfied with this).

@rgrp which spreadsheet are you referring to? If it's the "CERT Feeds Data Inventory", the numeric risk IDs that actually appear in our current ETLv1 data don't appear to be listed here.

rufuspollock commented 8 years ago

@chorsley i mean the risk and places inventory from old version in May: https://docs.google.com/a/atomatic.net/spreadsheets/d/1PosJEjJ-exlPER8ycVwWtDdQOuCsj-41Ctiuts7qGVY/edit?usp=drive_web (minus the summary statistics)

chorsley commented 8 years ago

The IDs in the data files to date are based on IDs in Postgres, which don't appear to be in the spreadsheet you linked. I think it's best to stick with the risk_ids.json for now, at least until ETLv2. We'll have a smoother way of dealing with this then, likely self-descriptive string IDs in the file.

aaronkaplan commented 8 years ago

On 09 Sep 2016, at 09:50, Rufus Pollock notifications@github.com wrote:

@chorsley i mean the risk and places inventory from old version in May: https://docs.google.com/a/atomatic.net/spreadsheets/d/1PosJEjJ-exlPER8ycVwWtDdQOuCsj-41Ctiuts7qGVY/edit?usp=drive_web (minus the summary statistics)

okay.. adding this + explanations of the risk into a JSON/CSV file might be really useful. However, the old list did not contain id numbers (ints). Our ETL process builds on top of ints for risk_ids.

Best, a.

rufuspollock commented 8 years ago

@aaronkaplan are we using ints because of size efficiency considerations?

I guess it will help but make the data a bit less "readable". That said space savings esp when we save to CSV may make this worth it.

Either way let's add the integer ids to the google doc / CSV and export it.

rufuspollock commented 8 years ago

@aaronkaplan @chorsley can we agree on key columns we want here? Here's a first stab:

id - name identifier e.g. openntp - sutiable for use in urls etc
id_int - integer id for use in databases etc
title - full title
description
category - not so sure about this one but @aaronkaplan has mentioned that we want to group risks at some point

aaronkaplan commented 8 years ago

On 09 Sep 2016, at 13:19, Rufus Pollock notifications@github.com wrote:

@aaronkaplan i'd recommend against using ints if you can here unless there are big size considerations. It will make the data that much more usable. That said space savings esp when we save to CSV may make this worth it. Either way let's add the integer ids to the google doc / CSV and export it.

If it's compressed CSV's the space savings argument is less valid IMHO. However, right now we have ints for risk_ids. that's just how it is now. We can of course change it to always expert the risk name in any large CSV dump of disaggregated data. Let us know your preference. I prefer IDs right now and an extra small table risk_id -> risk name

a.

kxyne commented 8 years ago

IMO Sequential IDs are OK because we can 2^x them to do one-hot encoding. This lets us collate IPs according to how many/which feeds they have and we can measure crossover (something I want to do post deadline).

rufuspollock commented 8 years ago

@kxyne is this now live in the reference data repo?

chorsley commented 8 years ago

First cut is now at https://github.com/cybergreen-net/reference_data/tree/master/risks. Currently has a col for category, but empty until we discuss what the categories are. Closing this issue - please reopen for further discussion, or feel free to edit directly.

aaronkaplan commented 8 years ago

Mobile

On 30.09.2016, at 04:28, kxyne notifications@github.com wrote:

IMO Sequential IDs are OK because we can 2^x them to do one-hot encoding. This lets us collate IPs according to how many/which feeds they have and we can measure crossover (something I want to do post deadline).

Cool idea ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

aaronkaplan commented 8 years ago

Mobile

On 30.09.2016, at 12:37, Chris Horsley notifications@github.com wrote:

First cut is now at https://github.com/cybergreen-net/reference_data/tree/master/risks. Currently has a col for category, but empty until we discuss what the categories are.

Categories should be according to the ecSIRT 2 taxonomy. Pls google "Enisa eCSIRT II ". Then search for the high level category.

In our case it is "vulnerable system" or so. For all 4 feeds.

Closing this issue - please reopen for further discussion, or feel free to edit directly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cybergreen-net / pm

Get risks as a proper reference dataset #24