ahlashkari / ALFlowLyzer

GNU General Public License v3.0
4 stars 0 forks source link

Application Layer Flow Analyzer (ALFlowLyzer)

As part of the Understanding Cybersecurity Series (UCS), ALFlowLyzer is a Python open-source project to extract application layer features from network traffic for Anomaly Profiling (AP) which is the third component of the NetFlowLyzer.

ALFlowLyzer generates bidirectional flows from the Application Layer of network traffic, where the first packet determines the forward (source to destination) and backward (destination to source) directions, hence the statistical time-related features can be calculated separately in the forward and backward directions. Additional functionalities include selecting features from the list of existing features, adding new features, and controlling the duration of flow timeout. In the first version, it supports DNS protocol and in the next versions, other protocols will be supported. For more information regarding the DNS flow definition, please refer to the corresponding paper in the Copyright section.

Table of Contents

Installation

Before installing or running the ALFlowLyzer package, it's essential to set up the necessary requirements on your system. Begin by ensuring you have both Python and pip installed and functioning properly (execute the pip3 --version command). Then, execute the following command:

pip3 install -r requirements.txt

You are prepared to install ALFlowLyzer. To proceed, execute the following command in the package's root directory (where the setup.py file is located), which will install the ALFlowLyzer package on your system:

On Linux:

python3 setup.py install

On Windows:

pip3 install .

After successfully installing the package, confirm the installation by running the following command:

alflowlyzer -h
usage: ALFlowLyzer [-h] [-c CONFIG_FILE] [-o] 
options:
 -h, --help            show this help message and exit
 -c CONFIG_FILE, --config-file CONFIG_FILE
                       JSON config file address.
 -o, --online-capturing
                       Capturing mode. The default mode is offline capturing. 

Execution

The core aspect of running ALFlowLyzer involves preparing the configuration file. This file is designed to facilitate users in customizing the program's behavior with minimal complexity and cost, thus enhancing program scalability. Below, we outline how to prepare the configuration file and subsequently demonstrate how to execute ALFlowLyzer using it.

Configuration File

The configuration file is formatted in JSON, comprising key-value pairs that enable customization of the package. While some keys are mandatory, others are optional. Below, each key is explained along with its corresponding value:

An example of a configuration file would be like this:

{
    "pcap_file_address": "/mnt/c/dataset/my_pcap_file.pcap",
    "output_file_address": "./output-of-my_pcap_file.csv",
    "label": "Benign",
    "number_of_threads": 4,
    "feature_extractor_min_flows": 2500,
    "writer_min_rows": 1000,
    "read_packets_count_value_log_info": 1000000,
    "check_flows_ending_min_flows": 20000,
    "capturer_updating_flows_min_value": 5000,
    "dns_activity_timeout": 30,
    "max_flow_duration": 120000,
    "floating_point_unit": ".4f",
    "max_rows_number": 800000,
    "features_ignore_list": [
        "dns_whois_domain_name",
        "dns_domain_email",
        "dns_domain_registrar",
        "dns_domain_creation_date",
        "dns_domain_expiration_date",
        "dns_domain_age",
        "dns_domain_country",
        "dns_domain_dnssec",
        "dns_domain_dnssec",
        "dns_domain_address",
        "dns_domain_city",
        "dns_domain_state",
        "dns_domain_zipcode",
        "dns_domain_name_servers",
        "dns_domain_updated_date"
    ]
}

In general, we recommend adjusting the values of the following options: number_of_threads, feature_extractor_min_flows, writer_min_rows, check_flows_ending_min_flows, and capturer_updating_flows_min_value, based on your system configuration. This is particularly important if your PCAP file is large (usually more than 4 GB with over 1 million TCP packets), to optimize program efficiency.

Argument Parser

You can use -h to see different options of the program.

To execute ALFlowLyzer, simply run the following command:

alflowlyzer -c YOUR_CONFIG_FILE

Replace YOUR_CONFIG_FILE with the path to your configuration file.

Moreover, this project has been successfully tested on Ubuntu 20.04, Ubuntu 22.04, Windows 10, and Windows 11. It should work on other versions of Ubuntu OS (or even Debian OS) as long as your system has the necessary Python3 packages (you can find the required packages listed in the requirements.txt file).

Architecture


Extracted Features

We currently have currently 130 features that are as follows:

  1. Duration
  2. Packets Numbers
  3. Receiving Packets Numbers
  4. Sending Packets Numbers
  5. Successful packet numbers (HTTP packets only)
  6. Successful packet rate (HTTP packets only)
  7. Delta Start
  8. Handshake Duration
  9. Total Bytes
  10. Receiving Bytes
  11. Sending Bytes
  12. Packets Rate
  13. Receiving Packets Rate
  14. Sending Packets Rate
  15. Packets Len Rate
  16. Receiving Len Packets Rate
  17. Sending Len Packets Rate
  18. Packets Len Min
  19. Packets Len Max
  20. Packets Len Mean
  21. Packets Len Median
  22. Packets Len Mode
  23. Packets Len Standard Deviation
  24. Packets Len Variance
  25. Packets Len Coefficient of Variation
  26. Packets Len Skewness
  27. Receiving Packets Len Min
  28. Receiving Packets Len Max
  29. Receiving Packets Len Mean
  30. Receiving Packets Len Median
  31. Receiving Packets Len Mode
  32. Receiving Packets Len Standard Deviation
  33. Receiving Packets Len Variance
  34. Receiving Packets Len Coefficient of Variation
  35. Receiving Packets Len Skewness
  36. Sending Packets Len Min
  37. Sending Packets Len Max
  38. Sending Packets Len Mean
  39. Sending Packets Len Median
  40. Sending Packets Len Mode
  41. Sending Packets Len Standard Deviation
  42. Sending Packets Len Variance
  43. Sending Packets Len Coefficient of Variation
  44. Sending Packets Len Skewness
  45. Receiving Packets Delta Len Min
  46. Receiving Packets Delta Len Max
  47. Receiving Packets Delta Len Mean
  48. Receiving Packets Delta Len Median
  49. Receiving Packets Delta Len Standard Deviation
  50. Receiving Packets Delta Len Variance
  51. Receiving Packets Delta Len Mode
  52. Receiving Packets Delta Len Coefficient of Variation
  53. Receiving Packets Delta Len Skewness
  54. Sending Packets Delta Len Min
  55. Sending Packets Delta Len Max
  56. Sending Packets Delta Len Mean
  57. Sending Packets Delta Len Median
  58. Sending Packets Delta Len Standard Deviation
  59. Sending Packets Delta Len Variance
  60. Sending Packets Delta Len Mode
  61. Sending Packets Delta Len Coefficient of Variation
  62. Sending Packets Delta Len Skewness
  63. Receiving Packets Delta Time Max
  64. Receiving Packets Delta Time Mean
  65. Receiving Packets Delta Time Median
  66. Receiving Packets Delta Time Standard Deviation
  67. Receiving Packets Delta Time Variance
  68. Receiving Packets Delta Time Mode
  69. Receiving Packets Delta Time Coefficient of Variation
  70. Receiving Packets Delta Time Skewness
  71. Sending Packets Delta Time Min
  72. Sending Packets Delta Time Max
  73. Sending Packets Delta Time Mean
  74. Sending Packets Delta Time Median
  75. Sending Packets Delta Time Standard Deviation
  76. Sending Packets Delta Time Variance
  77. Sending Packets Delta Time Mode
  78. Sending Packets Delta Time Coefficient of Variation
  79. Sending Packets Delta Time Skewness

note: Delta features are about differences (time or length or anything else) between two 'consecutive' packets.

DNS Related

  1. Domain Name
  2. WhoisDomainName
  3. Top Level Domain
  4. Second Level Domain
  5. Domain Name Length
  6. Sub Domain Name Length
  7. Domain Name 1-Gram
  8. Domain Name 2-Gram
  9. Domain Name 3-Gram
  10. Numerical Percentage
  11. Character Distribution
  12. Character Entropy
  13. DomainEmail
  14. DomainRegistrar
  15. DomainCreationDate
  16. DomainExpirationDate
  17. DomainAge
  18. DomainCountry
  19. DomainDNSSEC
  20. DomainOrganization
  21. DomainAddress
  22. DomainCity
  23. DomainState
  24. DomainZipcode
  25. DomainNameServers
  26. DomainUpdatedDate
  27. Continuous Numeric Max Len
  28. Continuous Alphabet Max Len
  29. Continuous Consonant Max Len
  30. Continuous Same Alphabet Max Len
  31. Vowel Consonant Ratio
  32. Conv Freq Vowel Consonant
  33. Distinct TTL Values
  34. TTL Values Min
  35. TTL Values Max
  36. TTL Values Mean
  37. TTL Values Mode
  38. TTL Values Variance
  39. TTL Values Standard Deviation
  40. TTL Values Median
  41. TTL Values Skewness
  42. TTL Values Coefficient of Variation
  43. Distinct A Resource Records
  44. Distinct NS Resource Records
  45. Average Authority Resource Records
  46. Average Additional Resource Records
  47. Average Answer Resource Records
  48. Query Resource Record Type
  49. Answer Resource Record Type
  50. Query Resource Record Class
  51. Answer Resource Record Class

Statistical Information Calculation

We use differnet libraries to calculate various mathematical equations. Below you can see the libraries and their brief definition based on their documentations:

Nine mathematical functions are used to extract different features. You can see how those functions are calculated in the ALFlowLyzer below:

  1. Min

    You know what it means :). The 'min' function (Python built-in) calculates the minimum value in a given list.

  2. Max

    Same as min. The 'max' function (Python built-in) calculates the minimum value in a given list.

  3. Mean

    The 'mean' function from 'statistics' library (Python built-in) calculates the mean value of a given list. According to the library documentation:

    The arithmetic mean is the sum of the data divided by the number of data points. It is commonly called “the average”, although it is only one of many different mathematical averages. It is a measure of the central location of the data.

    This runs faster than the mean() function and it always returns a float. The data may be a sequence or iterable. If the input dataset is empty, raises a StatisticsError.

  4. Median

    The 'median' function from 'statistics' library (Python built-in) calculates the mean value of a given list. According to the library documentation:

    Return the median (middle value) of numeric data, using the common “mean of middle two” method. If data is empty, StatisticsError is raised. data can be a sequence or iterable.

    The median is a robust measure of central location and is less affected by the presence of outliers. When the number of data points is odd, the middle data point is returned. When the number of data points is even, the median is interpolated by taking the average of the two middle values:

  5. Variance

    The 'pvariance' function from 'statistics' library (Python built-in) calculates the mean value of a given list. According to the library documentation:

    Return the population variance of data, a non-empty sequence or iterable of real-valued numbers. Variance, or second moment about the mean, is a measure of the variability (spread or dispersion) of data. A large variance indicates that the data is spread out; a small variance indicates it is clustered closely around the mean.

    Raises StatisticsError if data is empty.

  6. Standard Deviation

    The 'pstdev' function from 'statistics' library (Python built-in) calculates the mean value of a given list. According to the library documentation:

    Return the population standard deviation (the square root of the population variance). See pvariance() for arguments and other details.

  7. Mode

    The 'mode' function from 'scipy.stats' library calculates the mode value of a given list. According to the library documentation, this function:

    Return an array of the modal (most common) value in the passed array.

    If there is more than one such value, only the smallest is returned. The bin-count for the modal bins is also returned.

  8. Coefficient of Variation

    The 'variation' function from 'scipy.stats' library calculates the mode value of a given list. According to the library documentation, this function:

    The coefficient of variation is the standard deviation divided by the mean.

    There are several edge cases that are handled without generating a warning:

    • If both the mean and the standard deviation are zero, nan is returned.

    • If the mean is zero and the standard deviation is nonzero, inf is returned.

    • If the input has length zero (either because the array has zero length, or all the input values are nan and nan_policy is 'omit'), nan is returned.

    • If the input contains inf, nan is returned.

  9. Skewness

    The 'skew' function from 'scipy.stats' library calculates the mode value of a given list. According to the library documentation, this function:

    For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution.

    The sample skewness is computed as the Fisher-Pearson coefficient of skewness, i.e.

    equation

    where

    equation

    is the biased sample ith central moment, and x- is the sample mean. If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient, i.e.

    equation


Output

flow_id timestamp src_ip src_port dst_ip dst_port protocol duration packets_numbers receiving_packets_numbers sending_packets_numbers handshake_duration delta_start success_packets_numbers success_packets_rate total_bytes receiving_bytes sending_bytes packets_rate receiving_packets_rate sending_packets_rate packets_len_rate receiving_packets_len_rate sending_packets_len_rate min_packets_len max_packets_len mean_packets_len median_packets_len mode_packets_len standard_deviation_packets_len variance_packets_len coefficient_of_variation_packets_len skewness_packets_len min_receiving_packets_len max_receiving_packets_len mean_receiving_packets_len median_receiving_packets_len mode_receiving_packets_len standard_deviation_receiving_packets_len variance_receiving_packets_len coefficient_of_variation_receiving_packets_len skewness_receiving_packets_len min_sending_packets_len max_sending_packets_len mean_sending_packets_len median_sending_packets_len mode_sending_packets_len standard_deviation_sending_packets_len variance_sending_packets_len coefficient_of_variation_sending_packets_len skewness_sending_packets_len min_receiving_packets_delta_len max_receiving_packets_delta_len mean_receiving_packets_delta_len median_receiving_packets_delta_len standard_deviation_receiving_packets_delta_len variance_receiving_packets_delta_len mode_receiving_packets_delta_len coefficient_of_variation_receiving_packets_delta_len skewness_receiving_packets_delta_len min_sending_packets_delta_len max_sending_packets_delta_len mean_sending_packets_delta_len median_sending_packets_delta_len standard_deviation_sending_packets_delta_len variance_sending_packets_delta_len mode_sending_packets_delta_len coefficient_of_variation_sending_packets_delta_len skewness_sending_packets_delta_len max_receiving_packets_delta_time mean_receiving_packets_delta_time median_receiving_packets_delta_time standard_deviation_receiving_packets_delta_time variance_receiving_packets_delta_time mode_receiving_packets_delta_time coefficient_of_variation_receiving_packets_delta_time skewness_sreceiving_packets_delta_time min_sending_packets_delta_time max_sending_packets_delta_time mean_sending_packets_delta_time median_sending_packets_delta_time standard_deviation_sending_packets_delta_time variance_sending_packets_delta_time mode_sending_packets_delta_time coefficient_of_variation_sending_packets_delta_time skewness_sending_packets_delta_time domain_name top_level_domain second_level_domain domain_name_length subdomain_name_length uni_gram_domain_name bi_gram_domain_name tri_gram_domain_name numerical_percentage character_distribution character_entropy max_continuous_numeric_len max_continuous_aphabet_len max_continuous_consonants_len max_continuous_same_alphabet_len vowels_consonant_ratio conv_freq_vowels_consonants distinct_ttl_values ttl_values_min ttl_values_max ttl_values_mean ttl_values_mode ttl_values_variance ttl_values_standard_deviation ttl_values_median ttl_values_skewness ttl_values_coefficient_of_variation distinct_A_records distinct_NS_records average_authority_resource_records average_additional_resource_records average_answer_resource_records query_resource_record_type ans_resource_record_type query_resource_record_class ans_resource_record_class
2022-04-15 01:00:59_192.168.116.100_42206_109.206.255.42_443 4/15/2022 1:00 192.168.116.100 42206 109.206.255.42 443 HTTPS 187.146098 457 163 294 0.002181 0.000112 0 0 368700 15074 353626 2.441942444 0.870978112 1.570983276 1970.11855411487 80.5467733413584 1889.5936464734 66 1517 806.7833698 1090 1514 696.4427299 485032.476 0.863233869 -0.02915255 66 850 92.47852761 66 66 86.10055025 7413.304754 0.931032884 5.745042137 66 1517 1202.809524 1514 1514 556.879545 310114.8277 0.462982321 -1.372465268 -784 692 -0.049382716 0 115.3815358 13312.8988 0 -2336.476099 -0.574504437 -1283 1398 -0.027303754 0 366.8574496 134584.3883 0 -13436.15409 0.361252195 45.05993915 1.155221727 0.00019002 6.983534339 48.76975187 0.00011301 6.045189573 6.106573202 0 45.05982494 0.638716519 0.000112057 5.22443504 27.29472149 5.6982E-05 8.17958341 8.356739983 not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow
2022-04-15 01:05:39_192.168.116.100_58528_192.168.91.24_80 4/15/2022 1:05 192.168.116.100 58528 192.168.91.24 80 HTTP 7.875475 1050 281 769 0.001688 0.000106 13 1.650694085 1198900 36130 1162770 133.3252915 35.68119399 97.66550239 152232.087588367 4587.76348371369 147675.57374355 66 10202 1141.809524 1514 1514 810.3565446 656677.7294 0.709712546 1.938364541 66 1428 128.5765125 66 66 240.7908131 57980.21568 1.872743385 3.891267082 66 10202 1512.054616 1514 1514 602.6786735 363221.5835 0.398582609 6.298889431 -1362 1362 -0.028571429 0 340.8741063 116195.1563 0 -11930.59372 -0.066420757 -8688 8688 -0.010416667 0 701.4665366 492055.302 0 -67340.78751 -0.555146197 2.694911957 0.028126061 0.000115871 0.202381878 0.040958425 0.00011301 7.19552868 11.15603825 0 2.695833921 0.010252362 6.98566E-05 0.12326028 0.015193097 6.69956E-05 12.02262252 18.48471894 not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow
2022-04-15 01:00:11_192.168.116.100_56471_192.168.92.11_53 4/15/2022 1:00 192.168.116.100 56471 192.168.92.11 53 DNS 0.002526 2 1 1 not a tcp connection not a tcp connection 0 0 220 102 118 791.7656374 0 0 87094.2201108471 0 0 102 118 110 110 102 8 64 0.072727273 0 102 102 102 102 102 0 0 0 0 118 118 118 118 118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 content-autofill.googleapis.com. .com .googleapis.com 32 16 ['c', 'o', 'n', 't', 'e', 'n', 't', '-', 'a', 'u', 't', 'o', 'f', 'i', 'l', 'l', '.', 'g', 'o', 'o', 'g', 'l', 'e', 'a', 'p', 'i', 's', '.', 'c', 'o', 'm', '.'] ['co', 'on', 'nt', 'te', 'en', 'nt', 't-', '-a', 'au', 'ut', 'to', 'of', 'fi', 'il', 'll', 'l.', '.g', 'go', 'oo', 'og', 'gl', 'le', 'ea', 'ap', 'pi', 'is', 's.', '.c', 'co', 'om', 'm.'] ['con', 'ont', 'nte', 'ten', 'ent', 'nt-', 't-a', '-au', 'aut', 'uto', 'tof', 'ofi', 'fil', 'ill', 'll.', 'l.g', '.go', 'goo', 'oog', 'ogl', 'gle', 'lea', 'eap', 'api', 'pis', 'is.', 's.c', '.co', 'com', 'om.'] 0 {'m': 1, 's': 1, 'p': 1, '.': 3, 'g': 2, 'l': 3, 'o': 5, '-': 1, 't': 3, 'i': 2, 'a': 2, 'f': 1, 'n': 2, 'c': 2, 'e': 2, 'u': 1} 3.81642803184602 0 10 2 2 0.75 0.53125 2 0 415 207.5 0 43056.25 207.5 207.5 0 1 1 0 0 0 0 [1, 1] [0, 1] [1, 1] [0, 1]
2022-04-15 01:01:40_192.168.116.100_43244_192.168.119.112_22 4/15/2022 1:01 192.168.116.100 43244 192.168.119.112 22 Others 6.917505 23283 7452 15831 0.00093 0.000222 0 0 24671240 501761 24169479 3365.808915 1077.302684 2288.839353 3566493.98880087 72537.3687561404 3494413.15581659 66 36266 1059.624619 1514 1514 765.659792 586234.917 0.722576447 6.898108476 66 1578 67.33239399 66 66 20.63231412 425.6923859 0.306424782 60.05231415 66 36266 1526.718401 1514 1514 424.6387625 180318.0786 0.278138236 60.85177157 -1512 1512 -0.001073681 0 29.21008551 853.2290956 0 -27205.54339 -0.001048411 -31856 34752 -0.00050537 0 567.7554589 322346.2611 0 -1123446.114 3.715976314 4.359194994 0.000928369 0.000169992 0.051381806 0.00264009 0.000170946 55.34633046 82.30143017 0 4.317461967 0.00043693 6.19888E-05 0.034980458 0.001223632 5.6982E-05 80.05959085 119.424366 not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow not a dns flow

Copyright (c) 2024

For citation in your works and also understanding ALFlowLyzer completely, you can find below published papers:

Contributing

Any contribution is welcome in the form of pull requests.

Project Team members

Acknowledgement

This project has been made possible through funding from the Natural Sciences and Engineering Research Council of Canada — NSERC (#RGPIN-2020-04701), Canada Research Chair (Tier II) - (#CRC-2021-00340) to Arash Habibi Lashkari and Mitacs Global Research Internship (MGRI) program for summer student.