OpenMined / PySyft

Perform data science on data that remains in someone else's server
https://www.openmined.org/
Apache License 2.0
9.44k stars 1.99k forks source link

Monitoring network usage #2256

Closed LaRiffle closed 5 years ago

LaRiffle commented 5 years ago

This is part of #2235

Description We need a tool to monitor how much information we share over the network, as SMPC is often said to consume a lot of bandwidth. This is very important to benchmark our implementation compared to similar existing works.

DanyEle commented 5 years ago

One manual approach to monitor traffic sent over websockets is via Wireshark. But having it integrated in PySyft would indeed be cool!

LaRiffle commented 5 years ago

Oh I didn't new we had already some solutions! Would be great to have something also for Virtualworkers since they are very practical for debug, and at the worker scale to have like a graph of bandwidth usage between workers

DanyEle commented 5 years ago

Oh I didn't new we had already some solutions! Would be great to have something also for Virtualworkers since they are very practical for debug, and at the worker scale to have like a graph of bandwidth usage between workers

Also it would be very useful to have a breakdown of the data transmitted as data points or models.

robert-wagner commented 5 years ago

A hacky way to do this would be to add something to the send function which increments a counter based on the size of the message being sent

kakirastern commented 5 years ago

Hi, I am interested in this issue, but might need some help/pointers to start...

kakirastern commented 5 years ago

So I will approach the problem via Wireshark first...

LaRiffle commented 5 years ago

Excellent idea! If you want to do this manualy you can also do it is the code: In that case you want for a worker to store the amount of data sent and received, like this would be to attributes of a worker: so each time some data is sent or received, you evaluate the size of the serialize data and you add it to this attributes. To catch the event "data is sent or received" you will need to inspect how the module serde works (ser-ialize / de-serialize). It's more hacky, but helps you understanding the code.

kakirastern commented 5 years ago

Excellent idea! If you want to do this manualy you can also do it is the code: In that case you want for a worker to store the amount of data sent and received, like this would be to attributes of a worker: so each time some data is sent or received, you evaluate the size of the serialize data and you add it to this attributes. To catch the event "data is sent or received" you will need to inspect how the module serde works (ser-ialize / de-serialize). It's more hacky, but helps you understanding the code.

Thanks for the tip! This hacky way sounds really interesting and promising... Am looking into it. Hopefully a PR will follow soon.

kakirastern commented 5 years ago

I am attempting a solution at my branch here: https://github.com/kakirastern/PySyft/tree/monitor-network-usage. I reckon it may take some time if a working solution will indeed be worked out, say a few weeks, as I am new to this repo and thus its code base.

Ankit-Dhankhar commented 5 years ago

I find this issue interesting and would love to help. @LaRiffle @kakirastern For size of a object send we can directly acumulate size of bin_message (for sent data) and bin_response (for data received). Should we have consider the actual size of data send using wireshark which include IP header and other information. Including that would have its own computational overhead but a accurate measure of data sent or received. will be available. On other hand we can directly take compressed size of object and assume them to be actual size with a proportionality constant. In which way should we proceed?

kakirastern commented 5 years ago

@Ankit-Dhankhar Any help would be much appreciated. I agree I should use two variables bin_message (for data sent) and bin_response (for data received) instead of just one data_size variable which would seem like a misnomer.

kakirastern commented 5 years ago

And yeah, I agree maybe using WireShark to monitor traffic sent over WebSockets would provide more details regarding the data sent or received, which might be useful to get a breakdown of the data transmitted, at the cost of some computational overheads.

kakirastern commented 5 years ago

I am thinking about adding something in the line of the following in "websocket_client.py" and "websocket_server.py". Would it work well?

...
import pyshark
...

@staticmethod
    def get_packet_size():
        """
       Returns the size of the serialized data using Wireshark.

       Args: TODO

       Returns: Size of the packet sent over WebSockets in a given event.
       """
        capture = pyshark.LiveCapture(interface='eth0')
        capture.sniff(timeout=60)

        for packet in capture:
            try:
                packet_size = packet.data.data
            except:
                packet_size = None
                raise Exception("Cannot determine packet size.")

        return packet_size

If not, I would really appreciate any feedback given so that I can learn from the experience.

DanyEle commented 5 years ago

I am thinking about adding something in the line of the following in "websocket_client.py" and "websocket_server.py". Would it work well?

...
import pyshark
...

@staticmethod
    def get_packet_size():
        """
       Returns the size of the serialized data using Wireshark.

       Args: TODO

       Returns: Size of the packet sent over WebSockets in a given event.
       """
        capture = pyshark.LiveCapture(interface='eth0')
        capture.sniff(timeout=60)

        for packet in capture:
            try:
                packet_size = packet.data.data
            except:
                packet_size = None
                raise Exception("Cannot determine packet size.")

        return packet_size

If not, I would really appreciate any feedback given so that I can learn from the experience.

Well, that only seems to be working for the ethernet interface:

capture = pyshark.LiveCapture(interface='eth0')

But if the user is connected to a WiFi network, this wouldn't work.

It would be nice to automatically detect the network interface to which user is connected, or let him choose from which network interface Wireshark should listen to (maybe have the network interface as a parameter?), and then sniff packets from there.

kakirastern commented 5 years ago

Well, that only seems to be working for the ethernet interface:

capture = pyshark.LiveCapture(interface='eth0')

But if the user is connected to a WiFi network, this wouldn't work.

It would be nice to automatically detect the network interface to which user is connected, or let him choose from which network interface Wireshark should listen to (maybe have the network interface as a parameter?), and then sniff packets from there.

Yes, thanks for pointing that out! According to the official pyshark docs, the argument interface can be set to None so that it would automatically detect the first available network interface the user is connected to, if I have not misinterpreted the original wording.

In the official docs it states that

param interface: Name of the interface to sniff on. If not given, takes the first available.

So I would change my code in websocket_client.py to:

...
import pyshark
...

@staticmethod
    def get_packet_size(interface=None):
        """
       Returns the size of the serialized data using Wireshark.

       Args: TODO

       Returns: Size of the packet sent over WebSockets in a given event.
       """
        capture = pyshark.LiveCapture(interface=interface)
        capture.sniff(timeout=60)

        for packet in capture:
            try:
                packet_size = packet.tcp.data
            except:
                packet_size = None
                raise Exception("Cannot determine packet size.")

        return packet_size

However, then I have another issue: Supposed I only sniff on tcp packets as WebSocket uses TCP as the transport protocol. Then, should I use the data attribute to get the data info, or should I use something like the pretty_print() method to get the package details? Or is there a third way to do this? The official docs is not really clear hence my concern.

DanyEle commented 5 years ago

Getting the first network interface is indeed a good idea. I just tried to boot Wireshark and it detect my wlan Interface as my network interface.

About your second point I can't really say, since I don't know those methods or attributes. The way I checked the traffic sent in Wireshark was by setting a filter for the traffic sent (assuming 8777 is alice's port) tcp.port == 8777 And after data was transmitted, I would click on Statistics --> Capture File Properties and look under "Captured" for a count of the data transmitted over that port. That included ACKs and re-transmission attempts' data too though.

kakirastern commented 5 years ago

Thanks for the tip! I did some experimenting on my own laptop and found that I needed to specify the interface used for my setup to work, otherwise an error dumpcap: There is no interface with that adapter index would be thrown. For my interface setting I used en0 for my WiFi connection. So I will modify the code to be as follows:

@staticmethod
def get_packet_info(interface=None):
    """
    Returns the size of the serialized data using Wireshark.

    Args:
        interface: A string. Name of the interface to sniff on.

    Returns: Size of the packet sent over WebSockets in a given event.
    """
    if interface is None:
        raise Exception("Please provide the interface used.")
    else:
        capture = pyshark.LiveCapture(interface=interface)
        capture.sniff(timeout=60)
        for packet in capture:
            try:
                packet_info = packet.pretty_print()
            except:
                raise Exception("Cannot determine packet info.")
        return packet_info

I also found out that I cannot do something like packet.tcp.data as the data attribute for tcp does not exist, or at least not anymore. I can do something like packet.tcp.pretty_print. If I stick to the former, i.e. packet.pretty_print(), I would get info for all three layers including eth, ip, tcp. If I go with the latter then only the info regarding the tcp layer would be generated for output. Is there a preference as to whether only the tcp packets are outputted? I could change the code again if this latter approach is preferred.

kakirastern commented 5 years ago

And I am guessing if I would like pyshark to detect the network interface I would need to pass the tshark_path argument to the LiveCapture method to specify the path of the tshark binary as a basic requirement.

Moreover, I think I should definitely add the following arguments to the get_data_info function if suitable for the desired purposes:

* bpf_filter: BPF filter to use on packets.
* display_filter: Display (wireshark) filter to use.
* only_summaries: Only produce packet summaries, much faster but includes very little information
* disable_protocol: Disable detection of a protocol (tshark > version 2)
* decryption_key: Key used to encrypt and decrypt captured traffic.
* encryption_type: Standard of encryption used in captured traffic (must be either 'WEP', 'WPA-PWD', or 'WPA-PWK'. Defaults to WPA-PWK).
* tshark_path: Path of the tshark binary
* output_file: Additionally save captured packets to this file.

Will follow up on PR #2360 from this point onwards.