elminster-aom / homeworks

Entertainment exercise for a basic web monitor
The Unlicense
1 stars 0 forks source link

Homework - Web monitoring

Homework description

This application tries to resolve a hypothetical request, developing a monitor that:

  1. Monitors website availability over network and collects:
    • HTTP response time
    • error code returned
    • pattern that is expected to be found on the page
  2. Produces corresponding metrics and passes these events through a Kafka instance into a PostgreSQL database

Scenario

How it works

There are 2 main programs:

In addition, a 3rd program is responsible for initializing the environment:

web_monitor_agent.py

This component is designed in a way that allows several copies of it run as processes on the same or several independent systems. Each process creates a bunch of threads which monitor the listed URLs (1 threads monitors 5 URLs). All threads publish to the same Kafka topic.

:information_source: The numbers of URLs monitored by thread (5) are tuned based on the HTTP GET request timeout (15 seconds). If 4 of URLs assigned to a thread suffer timeouts, the 5th doesn't get its next monitoring check delayed (e.g.: 5 URLs * 15 max. time check = 60 seconds == MONITORING_RETRY_SECS=60).

Note: Its main restriction is max. number of open sockets.

sink_connector.py

This component is designed thinking of performance. Since its code is not Threadsafe, see kafka-python -- Project description:

Thread safety

The KafkaProducer can be used across threads without issue, unlike the KafkaConsumer which cannot.

While it is possible to use the KafkaConsumer in a thread-local manner, multiprocessing is recommended.

it has been implemented as a mono-thread which intensifies memory usage for a better performance.

Therefore, it consumes Kafka messages in windows of either time or number of messages and stores them in group (as a batch) in a Postgres database, where transactional commit is disabled (we relax this setting since all our SQL operations are ACID).

Additionally, performance can be further improved if both Kafka and Postgres components work independently in a continuous stream of data, for example using Store_manager.insert_metrics_copy() and/or implementing shared memory (mmap system call).

On the other hand, for ensuring that our storage is optimized for metrics (time-series data) and can store them for long periods of time, we took profit of TimescaleDB plug-in, e.g:

Scalable

  • Transparent time/space partitioning for both scaling up (single node) and scaling out (forthcoming).
  • High data write rates (including batched commits, in-memory indexes, transactional support, support for data backfill).
  • Right-sized chunks (two-dimensional data partitions) on single nodes to ensure fast ingest even at large data sizes.
  • Parallelized operations across chunks and servers.

initialize_infra.py

It initializes the environment, creating the required resources on Kafka and Postgres services.

How to install

  1. Clone or download a ZIP of this project, e.g.:
    $ git clone git@github.com:elminster-aom/homeworks.git
  2. Ensure that you have the right version of Python (v3.9, see below)
  3. Create and activate Python Virtual Environment and install required packages, e.g.:
    $ python3 -m venv homeworks \
    && source homeworks/bin/activate \
    && python3 -m pip install --requirement homeworks/requirements.txt
  4. Move into the new environment:
    $ cd homeworks

Further details on Installing packages using pip and virtual environments

How to set up and run

  1. Create (if doesn't exist already) a Kafka and PostgresSQL service ([aiven.io] is an interesting option)
  2. In the case of Kafka, you need to download files for authentication process. Where to find them and where set their path is described bellow, in .env section
  3. All available settings are based on an environment variables file in the home of our application. For its creation you can use this template: env_example, e.g.:
    $ cp docs/env_example .env
    $ nano .env
    # For information of its parameters, see .env section below
    $ chmod 0600 .env
  4. Run initialize_infra.py for initializing the infrastructure *
    $ ./initialize_infra.py
  5. Start collecting metrics using web_monitor_agent.py **
    $ ./web_monitor_agent.py
  6. Store them using sink_connector.py **
    $ ./sink_connector.py

    * It needs to be run only once per environment, for initialization reasons

** They can run on the same server or different ones

Local validation tests

Once completed above section, How to install. Tests can be run like:

# Validate that sensitive data is protected
$ ./tests/security_test1.sh
$ ./tests/security_test2.sh

# Validate that infrastructure is properly created
$ python3 -m pytest tests/tests.py

# Validate that all parts work together: URL monitoring, Kafka communication and DB storing
$ ./tests/integration_test.sh

.env

Additional considerations

  1. Only Unix-like systems are supported
  2. The code has been tested with only these versions (different versions may work too but we cannot ensure it):
    • Kafka 2.7.0
    • PostgresSQL 13.2
    • Python 3.9.4
    • TimescaleDB 2.1
  3. For a detailed list of Python modules check out the requirements.txt

Areas of improvement

Review the list of TODOs

References

I would like to reference some useful information sources which have been crucial for the implementation of this solution and give thanks to their creators for their collaboration: