Closed aaronkaplan closed 8 years ago
@chorsley i think it would probably be better to port the google spreadsheet we already have since that has additional info.
What do people think?
I've added to S3 @ /dev/clean/risk_ids.json for the moment, based on the Postgres table. (Note - I don't have rights to close the issue, so please feel free if you're satisfied with this).
@rgrp which spreadsheet are you referring to? If it's the "CERT Feeds Data Inventory", the numeric risk IDs that actually appear in our current ETLv1 data don't appear to be listed here.
@chorsley i mean the risk and places inventory from old version in May: https://docs.google.com/a/atomatic.net/spreadsheets/d/1PosJEjJ-exlPER8ycVwWtDdQOuCsj-41Ctiuts7qGVY/edit?usp=drive_web (minus the summary statistics)
The IDs in the data files to date are based on IDs in Postgres, which don't appear to be in the spreadsheet you linked. I think it's best to stick with the risk_ids.json for now, at least until ETLv2. We'll have a smoother way of dealing with this then, likely self-descriptive string IDs in the file.
On 09 Sep 2016, at 09:50, Rufus Pollock notifications@github.com wrote:
@chorsley i mean the risk and places inventory from old version in May: https://docs.google.com/a/atomatic.net/spreadsheets/d/1PosJEjJ-exlPER8ycVwWtDdQOuCsj-41Ctiuts7qGVY/edit?usp=drive_web (minus the summary statistics)
okay.. adding this + explanations of the risk into a JSON/CSV file might be really useful. However, the old list did not contain id numbers (ints). Our ETL process builds on top of ints for risk_ids.
Best, a.
@aaronkaplan are we using ints because of size efficiency considerations?
I guess it will help but make the data a bit less "readable". That said space savings esp when we save to CSV may make this worth it.
Either way let's add the integer ids to the google doc / CSV and export it.
@aaronkaplan @chorsley can we agree on key columns we want here? Here's a first stab:
On 09 Sep 2016, at 13:19, Rufus Pollock notifications@github.com wrote:
@aaronkaplan i'd recommend against using ints if you can here unless there are big size considerations. It will make the data that much more usable. That said space savings esp when we save to CSV may make this worth it. Either way let's add the integer ids to the google doc / CSV and export it.
If it's compressed CSV's the space savings argument is less valid IMHO. However, right now we have ints for risk_ids. that's just how it is now. We can of course change it to always expert the risk name in any large CSV dump of disaggregated data. Let us know your preference. I prefer IDs right now and an extra small table risk_id -> risk name
a.
IMO Sequential IDs are OK because we can 2^x them to do one-hot encoding. This lets us collate IPs according to how many/which feeds they have and we can measure crossover (something I want to do post deadline).
@kxyne is this now live in the reference data repo?
First cut is now at https://github.com/cybergreen-net/reference_data/tree/master/risks. Currently has a col for category, but empty until we discuss what the categories are. Closing this issue - please reopen for further discussion, or feel free to edit directly.
Mobile
On 30.09.2016, at 04:28, kxyne notifications@github.com wrote:
IMO Sequential IDs are OK because we can 2^x them to do one-hot encoding. This lets us collate IPs according to how many/which feeds they have and we can measure crossover (something I want to do post deadline).
Cool idea ;)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Mobile
On 30.09.2016, at 12:37, Chris Horsley notifications@github.com wrote:
First cut is now at https://github.com/cybergreen-net/reference_data/tree/master/risks. Currently has a col for category, but empty until we discuss what the categories are.
Categories should be according to the ecSIRT 2 taxonomy. Pls google "Enisa eCSIRT II ". Then search for the high level category.
In our case it is "vulnerable system" or so. For all 4 feeds.
Closing this issue - please reopen for further discussion, or feel free to edit directly.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Currently on the server we have a table risks in the postgresql DB. Please dump it as CSV and put it to the reference data repo so that we have a clearly defined mapping risk_id -> risk name
Thx