generate statistics based on input pcap

fmadio commented 5 years ago

used to generate more accurate traffic flows by profiling a PCAP and replicating the traffic pattern some what

nanji-fmad commented 5 years ago

Hi, I checked genflow code and also ran it on my PC. For ex: ./pcap_genflow --pktcnt 100 --flowcnt 10 --pktsize 512 > test.pcap

So, it generates 100 pkts and put it in 10 different flows (round robin). Here 5 tuples are random.

Now since customer can not send the actual pcap...but we want similar traffic...so we will model actual pcap...create histograms (this may hv flow info only..without actual data)...and run this tool again on histograms so it can replicate the similar traffic flows as customer's original pcap.

Step 1: ./pcap_genflow --gen_histograms ==> This will generate plain text histograms (or may be json array)

step 2: ./pcap_genflow --gen_pcap ==> This will generate pcap file, similar to current implementation but flows and pkt length would be as per histograms.

Questions: 1) Can we use "pcap2json" or similar your other tool to analyze the traffic first and then from that generate the histograms. Because we might need to generate histograms only for some specific traffic, say for topN flows, or for X mac address, etc. So, that kind of logic is already there in pcap2json.

2) Do we need to list all the packets with exact header details from orininal pcap ? Say we have 20 pkts in one flow...and we have max and min length of the pkts of that flow. So, do we need to also capture the size of each pkt in that flow or just we can take the average and generate ?

3) Do we need to hide actual IP address information and replace with random IP/port or we can save that info in histogram ? Exactly what do we need to save (or what we can save) in histogram.

nanji-fmad commented 5 years ago

Reply from Aaron:

Answers inline

Step 1: ./pcap_genflow --gen_histograms ==> This will generate plain text histograms (or may be json array)

JSON would be nicer, but its a PITA to process in C. What do you suggest?

step 2: ./pcap_genflow --gen_pcap ==> This will generate pcap file, similar to current implementation but flows and pkt length would be as per histograms.

Questions: 1) Can we use "pcap2json" or similar your other tool to analyze the traffic first and then from that generate the histograms. Because we might need to generate histograms only for some specific traffic, say for topN flows, or for X mac address, etc. So, that kind of logic is already there in pcap2json.

Do you have any suggestions? Copy the code or instrument pcap2json instead, bit worried modifying pcap2json itself would add too much bulk to the program. Its already getting on the heavy side.

Was thinking of creating flow records, for each flow having various histograms e.g. packet size and something about timming of packets. This would end up in a TopN list where only the N would get exported out.

2) Do we need to list all the packets with exact header details from orininal pcap ? Say we have 20 pkts in one flow...and we have max and min length of the pkts of that flow. So, do we need to also capture the size of each pkt in that flow or just we can take the average and generate ?

It specifically can not include exact details, e.g. IP MAC Port etc can not be included in the output file, as this is the customers private information. Need to use enumeration

Then a histogram on how often each flow is seen in the PCAP. e.g.

Flow 0 | Histgoram Flow 1 | Histgoram . .

See here for packt size histograms. Its easy to do, just consumes a bit of memory. Min / Max / Mean do not characterize the traffic pattern very well https://github.com/fmadio/pcap2json/blob/master/flow.c#L2099

3) Do we need to hide actual IP address information and replace with random IP/port or we can save that info in histogram ? Exactly what do we need to save (or what we can save) in histogram.

Everything needs to be enumerated, Flow 0, Flow 1, Host 0, Host 1, TCP flow 0, UDP flow 0, VLAN 0, VLAN1, MPLS 0, MPLS 1 etc etc no actual information from the PCAP can be used.

nanji-fmad commented 5 years ago

As of now I have thought to put histogram generation part in "pcap2json" (otherwise there's a lot lot of code copy required).Output will be plain text, not json.And pcap generation from histogram will be added to pcap_genflow.

I know pcap2son is becoming bigger, so later I can take task to split the common part or reusable part withing pcap2json and either make it as a link-able library or the other way.And "pcap_genflow" or similar tools can use those functionality.By doing this, we don't need to maintain multiple code copies, finally number of lines of code will be small and maintaining will be very easy. Let me know what you think..

fmadio commented 5 years ago

If thats the simplest way, then no problem. Please advise your thoughts on the the format of plaint text output.

nanji-fmad commented 5 years ago

Flow <number> | <Flow property> | <Histogram>
Flow <number> | <Flow property> | <Histogram>
...
...
...
Where:
Flow property: <ETH/VLAN/MPLS> | TCP/UDP | Pkt count
Histogram: [<pkt-1 property> <pkt-2 property> ...]
Pkt-N Property: timestamp | pkt size

If required then "Pkt property" can also include protocol specific information like ACK count etc.

fmadio commented 5 years ago

Understood, dont think TCP flags needs to be included at this point.

Guess could configure pcap2json to have say a 1H snapshot time, then output the stats for that snapshot.

Probably need packet size histogram to be CLI configurable. 1B to MTU sized

Packet arrival time histogram not sure the best approach. Ideal approach is a 2 pass algo for that but ... not really possible given the data size. Probably needs to be CLI configurable also.

Any ideas about the flow frequency histogram. e.g. when in pcap_genflow how to decide which flow to output next?

nanji-fmad commented 5 years ago

Probably need packet size histogram to be CLI configurable. 1B to MTU sized -> So we don't need to store each pkt size and arrival timestamp of the flow ?

Any ideas about the flow frequency histogram. e.g. when in pcap_genflow how to decide which flow to output next? -> Based on the timestamp of the first packet of the flow ?

nanji-fmad commented 5 years ago

I thought to generate each packet independently based on arrival timestamp and pkt size so I thought to store those info for each pkt.

This way we can replicate the exact user flow..

fmadio commented 5 years ago

ahh, was thinking a more statistical approach as the amount of data would be massive. Let me ask the client if logging the timestamp delta and packet size of every packet would violate their privacy policy.

fmadio commented 5 years ago

Customer is OK, lets try logging each packet in the TopN flows with dTime and Size. See how much memory/storage it ends up taking.

Exciting!

fmadio / pcap_genflow

generate statistics based on input pcap #1