cybergreen-net / pm

Tech project management repo (issue tracker only)
2 stars 1 forks source link

First sample of clean disaggregated data in bitstore #8

Closed rufuspollock closed 8 years ago

rufuspollock commented 8 years ago

Planned due date: Tues 30th 6pm GMT

@chorsley @kxyne creating out of our call earlier today.

rufuspollock commented 8 years ago

@chorsley @kxyne did you see this?

chorsley commented 8 years ago

@rgrp It's probably not going to be today as we've been looking at the date issues as discussed, but it's on the schedule for tomorrow.

rufuspollock commented 8 years ago

@chorsley any update?

chorsley commented 8 years ago

The data is now available in private-bits-cybergreen-net under /dev/clean/dns-scan. This is 1:10 sampled, enriched data for 3 weeks in August, format is:

timestamp, risk id, IP address, ASN, country code.

chorsley commented 8 years ago

There is now a second risk type, ntp, under /dev/clean/ntp-scan. Additionally, we now have /dev/clean/risk_ids.json which explains how the numeric risk IDs map to string identifiers.

Please also note - it's highly likely we'll review these path names, so please treat these as "yet to be confirmed".

rufuspollock commented 8 years ago

@chorsley couple of quick comments based on use:

chorsley commented 8 years ago

@rgrp follow ups:

Can we get header line in the CSV files by default i.e. first line is column headings

We will make sure the ETLv2 produces this. I'll do that if / when we do another run with ETLv1.

Can the date field just be data or do we need time e.g. 2016-08-05 vs 2016-08-05 02:00:06.0+00?

There's likely some interesting analysis we can do here, like working out the distribution of a scan over time, or calculating biases towards a particular time of day. So, I'd prefer to keep this in.

If we compress can we put compression as last part of file name e.g. xyz.csv.gz rather than xyz.csv.gz

Yes, this was a quick hack on my part - I just tacked the .csv extension after the original file name, but you'll see it's not compressed. ETLv2 won't have this problem.

Generally you don't need to sort data export by time - though fine if you do. May be easier for you do dumps without worrying about order :-)

The scan data we receive for open NTP and DNS already roughly orders by date (I'm assuming it's the order the results are received - you can get the odd occasion where lines are out of order). In the enriched files you have, no extra sorting has been performed by us.

rufuspollock commented 8 years ago

To complete what remains:

chorsley commented 8 years ago

Just to update: seems like the CSV format is acceptable, so now working on getting all CSV sets up on S3 using ETLv2.

rufuspollock commented 8 years ago

@chorsley great! When do you think we'll have complete data in there?

chorsley commented 8 years ago

First 4 sets there, some glitches with NTP data sizes, but complete for the most part. I'll raise a new issue for NTP data problem.