This project aims to create a high-performance multi-threaded network server capable of managing incoming connections, processing text data, and analysing patterns within the data.
I am using the Gutenberg Project (https://www.gutenberg.org) to obtain large text files for this project.
Books are downloaded in plain text format (UTF-8) and saved for testing, e.g. 'Great Expectations', 'The Adventures of Sherlock Holmes' and 'The Wonderful Wizard of Oz'.
To send these text files to the program, I am utilising the netcat tool (nc). To install the package on Linux, run sudo apt-get install netcat
.
Using netcat to transmit a text file to the server, the following command is used:
nc localhost <port> -i <delay> -q 0 < <filename>.txt
To compile the source code, run:
gcc -O2 -Wall -pthread server.c -o <output file name>
To start the server, run:
./<output file name> -l <listening port> -p "<search pattern>"
The server is written in C. It listens for incoming connections on the port specified in the command line prompt.
See https://www.geeksforgeeks.org/socket-programming-cc/ for the socket implementation tutorial.
A new thread is created for each incoming client connection. This approach allows multiple clients to connect simultaneously.
In each thread, non-blocking reads are implemented from the sockets to efficiently receive and store data in a global shared list.
The Shared List stores and links every line read across all threads, keeping track of the history of how data has arrived and been processed.
A pthread_mutex
has been implemented to avoid race conditions across concurrent client threads when writing to the list.
book_next
pointer is added to each list node on the shared list. This ensures book lines in the correct order.)After adding each line, the server checks if it contains a specified search pattern. If a match is found, the program will track the number of lines that contain the search pattern and update the next_frequent_search
pointer to navigate these lines.
When accessing from the shared list, a pthread_mutex
is utilized to ensure only one analysis thread is reading / one client thread is writing to the list at any given time.
The pattern frequency analysis is handled by multiple concurrent threads that output the analysis results at regular incremental intervals, i.e. every 2 seconds (first thread), 4 seconds (second thread), and so on.
If there are competing threads to print to the console, only the first analysis thread that started executing will have printing rights.
This is established using:
first_thread_printing
conditional variableThe thread orders the book with the highest pattern occurrence frequency first and prints to the console in the following format:
{rank} --> Book: {book_title}, Pattern: "{search_pattern}", Frequency: {frequency_count}