NoahSewBSU / CS321-002-f24-team-02

This is for CS321 Data Structures. The objective is to create a program using BTrees to sort large amount of objects and to gain more experience working with a team of programmers.
0 stars 0 forks source link

CS 321 OpenSSH Server Log Project

Table of contents:


Learning Objectives

Starter Code

This repository contains:

:heavy_exclamation_mark: NOTE: Do NOT fork this repository, because the forked repository cannot have its own GitHub issues, which will be used as Scrum tasks.

:heavy_exclamation_mark: NOTE: Do NOT modify the package structure in the src/ folder, otherwise the project may not build correctly using gradle.

Ensure that we have the correct JDK version

Use the following command to check our installed JDK version:

$ javac -version

This project does not work with JDK 24 (to be released in March 2025).

It is recommended to use any of the following versions:

:book: See this wiki page for additional details regarding the supported Java versions and links to download the correct JDK versions.

One-Time Team Setup

One team member should:

$ git clone https://github.com/BoiseState/CS321_Cybersecurity.git
$ cd CS321_Cybersecurity
$ ./gradlew tasks

The last command should perform a one-time gradle setup, followed by listing all the available gradle tasks and their descriptions.

NOTE: On Windows, the ./gradlew command should be replaced with gradlew (which will call the gradlew.bat file)

The same team member should push the cloned repository to the new private repository. This can be done by changing the remote URL of the cloned repository to the new private repository's URL.

$ git remote set-url origin NEW_URL_OF_YOUR_NEW_PRIVATE_REPOSITORY
$ git remote -v
$ git push

The other team members should then clone the newly created student repository containing the starter code.

Compile and Run the Project from the Command Line

Gradle allows running unit tests and code from IDEs, or the command line, as described below.

Run all the JUnit tests and print a summary of the results:

$ ./gradlew test

Note that if you initially run the tests, they will fail as you have not yet finished the project!

Run the main method from SSHCreateBTree.java and pass the appropriate <arguments>:

$ ./gradlew createJarSSHCreateBTree
$ java -jar build/libs/SSHCreateBTree.jar <arguments>

Run the main method from SSHSearchBTree.java and pass the appropriate <arguments>:

$ ./gradlew createJarSSHSearchBTree
$ java -jar build/libs/SSHSearchBTree.jar <arguments>

Run the main method from SSHSearchDatabase.java and pass the appropriate <arguments>:

$ ./gradlew createJarSSHSearchDatabase
$ java -jar build/libs/SSHSearchDatabase.jar <arguments>

Run the project from an IDE: IntelliJ IDEA, VSCode or Eclipse

Note that the following wiki pages may refer for GeneBankCreateBTree --> simply replace that with SSHCreateBtree. In general, replace GeneBank with SSH.

Eclipse

This repository is an Eclipse project, and can be directly opened in Eclipse.

:book: See this wiki page for additional instructions to run this project in Eclipse.

IntelliJ IDEA

This project can be opened with IntelliJ IDEA.

:bulb: HINT: As a student, you can get IntelliJ IDEA for free (using an academic license) by signing up with your Boise State email.

:book: See this wiki page for additional instructions to run this project in IntelliJ IDEA.

VSCode

Alternatively, this project can be opened with VSCode.

:book: See this wiki page for detailed instructions to run this project in VSCode.

Notes for creating additional files and tests, while keeping the Gradle project structure

We can add as many classes as we want in src/main/java, and gradle should build them automatically. In other words, we should not have to make any changes to the build.gradle.

Also, we can add new test files with new tests cases in src/test/java and those will be run automatically by gradle or our IDE.


Scrum Process

The focus of this project is to learn about data structures, while working effectively in a group. In addition, given the small project scope, and the fixed set of requirements that are already defined (and will not need to be elicited with the use of a Product Owner), the team can customize the Scrum process learned in CS-HU 271 (or CS 208) and focus exclusively on:

Scrum Board

Creating the tasks upfront will allow dividing and assigning the work in order to provide transparency and accountability within the team.

Use the Projects tab (i.e., a simplified version of ZenHub) to configure our own team Scrum board, based on this project example (feel free to copy the contents of these tasks to your Scrum board).

Your Scrum board should contain the following columns (pipelines):

Column Name Description
Product Backlog All (unassigned) tasks that are going to be completed by the team throughout the duration of the project
Sprint Backlog Tasks proposed to be implemented in the current week (sprint), assigned to developers
In Progress Tasks currently being worked on
Review/QA Tasks ready to be reviewed by another team member
Closed Completed tasks, whose corresponding code is integrated in the master branch

Tasks should be assigned to the developer working on them. Each team member should add to the project log file, Project-Log.md, the tasks (e.g., https://github.com/StudentUserNameHostingRepo/CS321_Cybersecurity/issues/123) completed that week, as described in the progress reports.

Here is an example of a valid task written in engineering language that is assigned to a developer. This task should be referenced by a commit containing a message similar to Implements task #3.

As a warm-up exercise, each team member should create a task similar to task #2 and then edit the README-submission.md file with their information.

Here is a sample project log from a team from a previous semester: Project-Log-sample.md


Project Requirements

Table of contents:


1. Introduction

In this Cybersecurity assignment, we will analyze server log files to better understand patterns in the network traffic. For example, this can help us determine what locations do attacks most often originate from, what times are the most common for attacks, among other possible insights. We will look for and analyze patterns, activities, and operations within a server's log and then store the corresponding data using multiple BTrees (separate BTrees for different patterns).

The amount of data that we have can be very large and any data structure is not likely to fit in memory. Hence, BTrees are a good choice for the task at hand. We will then store our findings from the various BTrees into a single SQL database and analyze its results for patterns and specific activities.

2. Background

If you are not familiar with internet, IP addresses, servers and clients, please watch this short video[7m] to give you sufficient background for this project. If you are not familiar with using SSH (Secure Shell), watch this short video[2m] to see how it is useful.

A SSH client connects to a SSH server software running on a remote system to get access to that system. The SSH server tracks all access by writing the relevant information to a log file.

Log files are text files that store events, messages, and processes within a system and/or network. They log information from users and are used to monitor server environments. They can detect when a network has become compromised or is under attack. For example, security logs check successful and unsuccessful logins, application log failures, and other various activity within a server. Secure Shell (SSH) is one of the most common ways of accessing remote servers. Therefore, in this project we will use log files from a SSH server. Below is an example of a Secure Shell log file that is tracking activity within the server:

Dec 12 18:46:17 LabSZ sshd[31166]: Accepted password for suyuxin from 218.18.43.243 port 9480 ssh2
Dec 12 18:58:24 LabSZ sshd[31243]: Invalid user zouzhi from 115.71.16.143
Dec 12 18:58:26 LabSZ sshd[31243]: Failed password for invalid user zouzhi from 115.71.16.143 port 38790 ssh2
Dec 12 18:59:41 LabSZ sshd[31246]: Invalid user admin from 223.155.239.92
Dec 12 18:59:43 LabSZ sshd[31246]: Failed password for invalid user admin from 223.155.239.92 port 55156 ssh2
Dec 12 18:59:45 LabSZ sshd[31246]: Failed password for invalid user admin from 223.155.239.92 port 55156 ssh2
Dec 12 18:59:47 LabSZ sshd[31246]: Failed password for invalid user admin from 223.155.239.92 port 55156 ssh2
Dec 12 18:59:48 LabSZ sshd[31246]: Failed password for invalid user admin from 223.155.239.92 port 55156 ssh2
Dec 12 18:59:53 LabSZ sshd[31246]: Failed password for invalid user admin from 223.155.239.92 port 55156 ssh2
Dec 12 19:16:22 LabSZ sshd[31264]: Invalid user support from 103.79.141.173
Dec 12 19:20:39 LabSZ sshd[31269]: Accepted password for curi from 14.17.22.31 port 5154 ssh2
Dec 12 19:20:39 LabSZ sshd[31275]: Accepted password for curi from 14.17.22.31 port 5158 ssh2
Dec 12 19:23:51 LabSZ sshd[31474]: Invalid user xiawei from 115.71.16.143
Dec 12 19:23:53 LabSZ sshd[31474]: Failed password for invalid user xiawei from 115.71.16.143 port 54132 ssh2
Dec 12 19:31:09 LabSZ sshd[31479]: reverse mapping checking getaddrinfo for 190-174-14-217.speedy.com.ar 
                                   [190.174.14.217] failed - POSSIBLE BREAK-IN ATTEMPT!
Dec 12 21:02:41 LabSZ sshd[31596]: Address 123.16.30.186 maps to static.vnpt.vn, but this does not map back 
                                   to the address - POSSIBLE BREAK-IN ATTEMPT!

Each line shows a date and timestamp of an activity, the name of the server running the OpenSSH Daemon (sshd) (with its process id), and the type of action followed by the user's name, IP address, port number and SSH protocol version.

Note that the above example shows five types of actions denoted by the starting phrases: Accepted, Failed, Invalid, reverse, or Address. Note that the last two types of entries are the same for our analysis as they are both the same type of break-in attempts. The full log file has a few other types of entries but those are not useful for our analysis.

With a quick scan of the above sample, we can see that there are multiple occurrences of failed passwords, accepted passwords, and invalid users. Upon closer look we can see that the block of failed passwords happened within seconds, indicating that possibly a non-human entity (automated script or an AI agent) was hitting the server multiple times!

The above log file sample is part of a larger raw SSH log file that forms the basis for our SSH log file parsing and analysis. The raw file has 653781 lines, which barely covers one month of logging at one server. Imagine how much larger this data would be if we were looking at data for one year for a few hundred servers! You may download this file and poke around in it.

There are many types of log files but the one we will focus on for this project comes from Zenodo-Loghub dataset.

To get a feel for log file analysis, please watch the following helpful video that shows how to use Linux command line tools for analyzing log files for specific patterns and attacks: Analyzing Log Files For Attacks[15m] (yes, this is a time-warped advertisement for CS153 :-))

However, the approach shown in video will be slow if we had hundreds of large log files. Writing a custom program that not only processes the log files but converts them into a data structure (and database) allows us to do faster analysis.

3. Specifications

3.1. Input Files and Data Wrangling

In data science, the process of data wrangling, also known as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

The raw SSH log file (70MB) contains lots of important data pertaining to the details of an activity. Note that this file is hosted on Google Drive as it is larger than files allowed on GitHub. Since we only need a select few keywords to use within our BTree keys, it is easiest to strip the file to only the necessary items. Below is the stripped version of the snippet from the log file (shown in the previous section) that will be appropriate for creating our BTrees.

12/12 18:46:17 Accepted suyuxin 218.18.43.243
12/12 18:58:24 Invalid zouzhi 115.71.16.143
12/12 18:58:26 Failed zouzhi 115.71.16.143 
12/12 18:59:41 Invalid admin 223.155.239.92
12/12 18:59:43 Failed admin 223.155.239.92
12/12 18:59:45 Failed admin 223.155.239.92
12/12 18:59:47 Failed admin 223.155.239.92
12/12 18:59:48 Failed admin 223.155.239.92
12/12 18:59:53 Failed admin 223.155.239.92
12/12 19:16:22 Invalid support 103.79.141.173
12/12 19:20:39 Accepted curi 14.17.22.31
12/12 19:20:39 Accepted curi 14.17.22.31 
12/12 19:23:51 Invalid xiawei 115.71.16.143
12/12 19:23:53 Failed xiawei 115.71.16.143 
12/12 19:31:09 reverse 190.174.14.217
12/12 21:02:41 Address 123.16.30.186

Many keywords like LabSZ, for, Dec, password, and sshd[xxxx]: were removed, leaving only the necessary information for creating our BTrees.

Once the raw SSH text file has been wrangled, the file should have the following amounts of types.

Type of Activity Line count
Accepted 181
Address 497
Failed 160,572
Invalid 14,165
reverse 18,909
Total: 194,324

Note that we have already done this for you and provided the wrangled log file (data/SSH_Files/SSH_log.txt) for you to use. However, it is important for you to review the process we used to simplify the raw log file into the simplified version we use to build the BTrees for a few reasons.

3.1.1 Demo

Here is a demo that provides a walk-through of using a text editor to wrangle and filter through certain keywords, phrases, and terms.

The conversion process uses regular expressions, which are a powerful technique for pattern matching in text files. Regular expressions are widely available in most programming languages as well as in the Bash shell.

3.1.2 Data Wrangling Program (Extra Credit)

While using a text editor to wrangle the data is feasible for one file, it would be infeasible if we have to do that for hundreds or thousands of log files. So for extra credit for this project, please write a Java program named SSHDataWrangler.java to wrangle the raw SSH data file into a useful form as described above. We will use regular expressions built into Java to accomplish the task.

See this tutorial for how to use regular expressions in Java.

Note that we have included the final wrangled file (under the folder data/SSH_Files), so you can diff your output with the final form to verify the correctness.

Please note that this part is extra credit so you can skip it and simply use the simplified SSH log file that we have provided as part of the repository. Including the wrangled log file also allows team members to proceed with other parts of the project sooner even if you do want to attempt the extra credit!

3.2. The Main Problem

Now that we have data wrangling out of the way, the motivation for the main problem is to analyze the frequency of certain activities and patterns within the log files; whether that be most commonly accepted passwords at certain IPs, commonly failed passwords at specific times of the day, or what common user's IPs are.

For the given log file, we want to convert its different activity types into various BTrees to better analyze specific types of operations. We will make a total of nine BTrees, each with a certain category of extracted data from the log file.

The following are the types of BTrees that will be created (the key value will be a combination of the two fields that are used in each BTree):

See below for a visual of the nine types of BTrees.

CyberSec-BTrees.jpg

Basically, each BTree has the key values along with their frequencies (the number of duplicates). Once we have a BTree for each type of activity, we will then search the BTrees for the top frequencies within each category and display those key values along with the frequencies. We will also traverse each tree and put the results into a SQL database to make it easier for an analyst to examine the data. We will also add a feature to dump each BTree into a dumpfile for testing purposes.

4. Design Issues

4.1. Memory

We can represent the key from a SSH log line (from the stripped log file) as a string of 32 characters long (which would require 64 bytes). No value should go over 32 characters but if it does, then we simply truncate the sequence to just the first 32 characters. This would form the key value.

Note that we always want to write a fixed size array on disk so that size of each TreeObject is known ahead of time. In order to do that, we can keep the key values to be their natural size but pad them with null characters when we store to the disk (and then remove the padding when we read the key values back from the disk).

4.2. Key Values

Note that the key values will be the two variables of that BTree's type concatenated with a dash (-)

Examples:

4.3. Class Design

We will need a BTree class as well as a BTreeNode class. The BTreeNode class may be an inner class in the main BTree class. The objects that we store in the BTree will be similar to the objects we stored in the previous Hashtable assignment. You should use the provided class TreeObject to represent the objects using the Comparable interface.

4.4 Priority Queues

In order to find the top k keys (by frequency), we will need a priority queue in the search programs. This is because the BTree is sorted by the key value and not by their frequencies. We created our own priority queue in Project 2, but for this project we will use the Priority Queue class available from the standard library in Java.

The Priority Queue will have the frequency as its primary key and the BTree key as its secondary key (so the compareTo will use frequency first and then the BTree key next to ensure stable sorting). This allows us to compare output against the expected output.

5. Implementation

We will create three (or four, if doing the extra credit part) programs:

5.1. Program Arguments

The required arguments for the four programs are shown below:

java -jar build/libs/SSHDataWrangler.jar --rawSshFile=<raw-ssh-file> \
          --sshFile=<wrangled-ssh-file>

java -jar build/libs/SSHCreateBTree.jar --cache=<0/1> --degree=<btree-degree> \
          --sshFile=<ssh-File> --type=<tree-type> [--cache-size=<n>] \
          --database=<yes/no> [--debug=<0|1>]

java -jar build/libs/SSHSearchBTree.jar --cache=<0/1> --degree=<btree-degree> \
          --btree-file=<btree-filename> --query-file=<query-fileaname> \
          [--top-frequency=<10/25/50>] [--cache-size=<n>]  [--debug=<0|1>]

java -jar build/libs/SSHSearchDatabase.jar --type=<tree-type> \
          --database=<sqlite-database-path> --top-frequency=<10/25/50>

Note that the arguments can be provided in any order. The backslash represents that the command is displayed on multiple lines but we would type it in one line (without the backslash).

<SSH Key> <frequency>

The following shows a segment of the dumpfile dumps/dump-user-ip.0.txt (it has a total of 5294 lines).

huangqx-115.71.16.143 20
huangt-183.136.169.234 2
huangzq-119.7.221.129 10
huawei-119.7.221.129 6
huawei-125.65.165.70 2
huawei-202.107.207.123 2
hub-87.120.255.161 2
huf-183.136.178.26 2
...

The following displays what each main driver must output:

Class Output
SSHCreateBTree - SSH_log.txt.ssh.btree.<type>.<degree> file (in the current folder)
- appropriately named table in SSHLogDB.db (database is in current folder)
- dump-<type>.<degree>.txt (in the current folder) if <debug>=1
SSHSearchBTree - output for the queries to standard out
SSHSearchDatabase - top 10/25/50 <SSH Key> <frequency> to standard out

5.2. Example Demonstration of Project

5.2.1. (Optional: Extra Credit) Wrangle the raw data file

Please download the raw SSH log file (70MB) from the Google Drive. Then wrangle it with your program as follows:

java -jar build/libs/SSHDataWrangler.jar --rawSshFile=SSH_log_raw.txt \
          --sshFile=SSH_log.txt

This program assumes that the raw file and output file are both in the top level of the project folder. Then compare the output file with the wrangled file that we have provided, as follows:

diff SSH_log.txt data/SSH_Files/SSH_log.txt

If it matches, there is no output (good news is no news).

5.2.2. Create one BTree using SSHCreateBTree

Using the following command:

java -jar build/libs/SSHCreateBTree.jar --cache=1 --degree=0 --sshFile=SSH_log.txt \
          --type=accepted-time --database=yes --cache-size=2000 --debug=1

Assumes that the wrangled log file is in the data/SSH_Files folder.

Outputs:

See below for a snippet of the dump file dump-accepted-ip.0.txt:

Key Frequency
Accepted-137.189.206.152 1
Accepted-137.189.206.243 1
Accepted-137.189.207.18 1
Accepted-137.189.207.28 1
Accepted-137.189.240.159 1
Accepted-137.189.241.19 2
Accepted-137.189.241.248 11
Accepted-137.189.88.215 12
Accepted-137.189.90.232 10
Accepted-14.17.22.31 4
Accepted-183.11.36.159 2
Accepted-183.11.39.245 1
Accepted-183.11.69.89 2
Accepted-183.62.156.108 5
Accepted-203.218.70.64 2
Accepted-218.17.80.182 7
Accepted-218.18.42.17 5
Accepted-218.18.43.243 1
Accepted-222.125.40.98 5
Accepted-222.240.177.42 7
Accepted-61.187.54.9 12
Accepted-67.160.100.130 1
Accepted-73.239.249.49 1

5.2.3. Search BTree in SSHSearchBTree

With arguments of:

java -jar build/libs/SSHSearchBTree.jar --cache=0 --degree=0 \    
          --btree-file=SSH_log.txt.ssh.btree.accepted-time.0 \
          --query-file=QUERY-accepted-time.txt \
          --top-frequency=25 --cache-size=10000 --debug=0

Assumes that the query files are in data/queries folder.

Outputs:

Key Frequency
Accepted-111.222.107.90 25
Accepted-119.137.63.195 14
Accepted-137.189.88.215 12
Accepted-61.187.54.9 12
Accepted-137.189.241.248 11
Accepted-137.189.90.232 10
Accepted-119.137.62.123 9
Accepted-218.17.80.182 7
Accepted-222.240.177.42 7
Accepted-113.116.236.34 6
Accepted-123.255.103.142 5
Accepted-123.255.103.215 5
Accepted-183.62.156.108 5
Accepted-218.18.42.17 5
Accepted-222.125.40.98 5
Accepted-14.17.22.31 4
Accepted-112.96.173.55 3
Accepted-112.96.33.40 3
Accepted-137.189.204.253 3
Accepted-113.118.187.34 2
Accepted-113.99.127.215 2
Accepted-137.189.205.44 2
Accepted-137.189.241.19 2
Accepted-183.11.36.159 2
Accepted-183.11.69.89 2

5.2.4. Search BTree Database in SSHSearchDatabase

With arguments of:

java -jar build/libs/SSHSearchDatabase.jar --database=SSHLogDB.db \
          --type=accepted-time --top-frequency=25 

Outputs to standard output stream:

Key Frequency
Accepted-14:20 5
Accepted-18:46 4
Accepted-00:02 3
Accepted-15:45 3
Accepted-09:17 2
Accepted-09:21 2
Accepted-09:22 2
Accepted-10:39 2
Accepted-11:40 2
Accepted-12:53 2
Accepted-14:42 2
Accepted-15:16 2
Accepted-16:14 2
Accepted-17:00 2
Accepted-18:03 2
Accepted-19:11 2
Accepted-19:20 2
Accepted-19:22 2
Accepted-19:26 2
Accepted-20:15 2
Accepted-00:03 1
Accepted-00:11 1
Accepted-00:17 1
Accepted-00:22 1
Accepted-00:43 1

5.3. Additional Implementation Remarks

5.3.1. Your programs should always keep the root node in the memory

Write the root node to disk file only at the end of the program and read it in when the program starts up. In addition, our program can only hold a few nodes in memory. In other words, we can only use a few BTreeNode variables (including the root) in our program (e.g., root, parent, current, child, temporary node for split). However, if the cache is enabled, we can store <cache-size> BTreeNode objects in the cache.

5.3.2. Metadata storage

We need to store some metadata about the BTree on disk. For example, we can store the degree of the tree, the byte offset of the root node (so we can find it), the number of nodes, and other information. This information should be stored at the beginning of the BTree file. We read the metadata when we open the BTree file, and we write it back (as it may have changed) when we close the BTree file at the end of the program.

5.3.3. Layout of the BTree in the binary file

The BTree is stored as a binary file on disk. This is the most efficient and compact way to store the BTree data structure, so it persists beyond the program. While it is possible to store the BTree as a text file, it will lead to severe slowdown in the runtime.

The BTree data file will have an initial metadata section. The metadata section should contain at least the byte offset of the root node. It may also optionally contain the degree of the BTree and the number of nodes. After the metadata, the rest of the file consists of BTreeNodes laid out one after the other. A new node is added to the end of the file.

5.3.4 Reading and Writing Nodes

6. Using a Cache

We will incorporate the generic Cache class from Project 1 to improve the performance of our BTree implementation. The size of the cache should be a command line argument. An entry in the cache is a BTreeNode. With the cache enabled command line option, the <cache-size> needs to be specified as between 100 and 10000 (inclusive). Using the cache greatly speeds up the execution time especially when searching larger BTrees.

7. Using a Database

Design a simple database to store the results (key values and frequencies) from the BTree. We will perform an inorder tree traversal to get the information to store in the database with the <tree type> as the table's name without the - (to prevent SQL syntax errors). This would be done at the end of creating the SSH BTree. Afterwards, we will create a separate search program named SSHSearchDatabase that uses the database instead of the BTree and the top searched query list outputted by SSHSearchBTree to get the top frequencies of a certain BTree.

$ ./gradlew createJarSSHSearchDatabase
$ java -jar build/libs/SSHSearchDatabase.jar --type=<tree-type> --database=<SQLite-database-path> \
            --top-frequency=<10/25/50>

We will use the embedded SQLite database for this project. The SQLite database is fully contained in a jar file that gradle will automatically pull down for us. See the database example in the section below on how to use SQLite.

7.1 Analyzing the Database

When searching for the top results in the database, we want to analyze the top key frequencies for patterns. The top keys should be printed out with their frequencies as demonstrated by the following segment for top 10 keys in an accepted-time BTree.

Accepted-14:20: 5
Accepted-18:46: 4
Accepted-15:45: 3
Accepted-00:02: 3
Accepted-20:15: 2
Accepted-19:22: 2
Accepted-19:20: 2
Accepted-19:11: 2
Accepted-12:53: 2
Accepted-11:40: 2

We observe from the code segment above that the top two accepted entries occur between 14:20 and 18:46 with all other entries mostly distributed past 11:40. The majority of Accepted occurrences happened between 12:00 to 19:15.

Below represents the table for analyzing a select few BTrees for their top results in their corresponding database and the questions to answer in a file named BTree-Database-Analysis.md that you will add to your project.

Tree Type degree Question:
accepted-ip 100 What is the most common first three IP digits in the top 10 entries? (137 at three times)
invalid-time 100 What is the range (highest-lowest) of the top 25 entries? (55-42=13)
failed-ip 100 What is the first three digits of the top 2 entries of 50 total entries? (183)
reverseaddress-ip 100 Is the top entry reverse or Address type in the top 25 entries? (reverse)
reverseaddress-time 100 Which hour was the most predominant for a reverse break in attempt in the top 25 entries? (11:00)
user-ip 100 Who is the predominant user in the top 25 entries? (root)

8. Useful Examples

The following examples from the class examples repository will be useful for this project.

9. Test Scripts

The wrangled SSH Log file is provided in the folder: data/SSH_Files.

The expected dump files and query results are provided in the folders: results/dump-files, results/btree-search results/db-search

Six test scripts are provided at the top-level of the project (for integration testing). These compare your results to the results files mentioned above. You can use the test scripts to run and compare results using the four test scripts as follows.

./gradlew createJarSSHCreateBTree
./gradlew createJarSSHSearchBTree
./gradlew createJarSSHSearchDatabase

./create-btrees.sh 
./check-dump-files.sh 

./search-btrees.sh 
./check-btree-search.sh 

./search-db.sh
./check-db-search.sh

The create-btrees.sh script creates all nine BTree types from the SSH Log file. As a side effect, it also creates the dump files and the SQL database that contains tables for each of the nine types of key values. The script will move your BTrees, dump files to the appropriate subfolder in the folder output in the repo.

The check-dump-files.sh compares the dump files from our code to the reference dump files.

The search-btrees.sh script searches all nine BTrees from the BTree files for all the key values found in the nine BTrees as well as just the top 25 entries. The script will store your output to the subfolder btree-search under the output folder.

The check-btree-search.sh script compares the results of search queries from searching the BTree to the reference results.

The search-db.sh script searches the database for the top 25 entries for each of the key value types. The script will store your output to the subfolder db-search under the output folder.

The check-db-search.sh script compares the results from the database to the reference results.

The instructors will also use these test scripts for the final testing of your project. Start off by running the tests on your machine.

10. Testing in the Cloud

We will set up Amazon AWS accounts for each student so that you can run larger tests in the cloud. Running our tests on AWS is required, so we can all get some more experience using the cloud. :cloud: :smiley:

Please see the AWS notes for a step-by-step guide on running your project on the AWS cloud.

11. Progress Reports

Each team member will fill out a progress report (via a survey) each week. The link to the survey will be provided by the instructor.

In addition, each team member should log their project-related activities for the week, including the URL to the tasks (e.g., https://github.com/StudentUserNameHostingRepo/CS321_Cybersecurity/issues/123) completed that week, in a separate file named Project-Log.md.

Here is a sample log file: Project-log-sample.md

As a reminder, each commit should link (reference) in the commit message the completed task (e.g., Implements task #123), in order to automatically link the task to the commit, and make the code changes directly available from the task itself.

It is expected that each team should have at least one meeting every week.

Progress reports are confidential.

12. Submission

Before submission, make sure that:

Make sure the instructor and the teaching assistant(s) have access to the repository.