This application tries to resolve a hypothetical request, developing a monitor that:
There are 2 main programs:
In addition, a 3rd program is responsible for initializing the environment:
This component is designed in a way that allows several copies of it run as processes on the same or several independent systems. Each process creates a bunch of threads which monitor the listed URLs (1 threads monitors 5 URLs). All threads publish to the same Kafka topic.
:information_source: The numbers of URLs monitored by thread (5) are tuned based on the HTTP GET request timeout (15 seconds). If 4 of URLs assigned to a thread suffer timeouts, the 5th doesn't get its next monitoring check delayed (e.g.: 5 URLs * 15 max. time check = 60 seconds == MONITORING_RETRY_SECS=60
).
Note: Its main restriction is max. number of open sockets.
This component is designed thinking of performance. Since its code is not Threadsafe, see kafka-python -- Project description:
Thread safety
The KafkaProducer can be used across threads without issue, unlike the KafkaConsumer which cannot.
While it is possible to use the KafkaConsumer in a thread-local manner, multiprocessing is recommended.
it has been implemented as a mono-thread which intensifies memory usage for a better performance.
Therefore, it consumes Kafka messages in windows of either time or number of messages and stores them in group (as a batch) in a Postgres database, where transactional commit is disabled (we relax this setting since all our SQL operations are ACID).
Additionally, performance can be further improved if both Kafka and Postgres components work independently in a continuous stream of data, for example using Store_manager.insert_metrics_copy()
and/or implementing shared memory (mmap system call).
On the other hand, for ensuring that our storage is optimized for metrics (time-series data) and can store them for long periods of time, we took profit of TimescaleDB plug-in, e.g:
Scalable
- Transparent time/space partitioning for both scaling up (single node) and scaling out (forthcoming).
- High data write rates (including batched commits, in-memory indexes, transactional support, support for data backfill).
- Right-sized chunks (two-dimensional data partitions) on single nodes to ensure fast ingest even at large data sizes.
- Parallelized operations across chunks and servers.
It initializes the environment, creating the required resources on Kafka and Postgres services.
$ git clone git@github.com:elminster-aom/homeworks.git
$ python3 -m venv homeworks \
&& source homeworks/bin/activate \
&& python3 -m pip install --requirement homeworks/requirements.txt
$ cd homeworks
Further details on Installing packages using pip and virtual environments
$ cp docs/env_example .env
$ nano .env
# For information of its parameters, see .env section below
$ chmod 0600 .env
initialize_infra.py
for initializing the infrastructure *
$ ./initialize_infra.py
web_monitor_agent.py
**
$ ./web_monitor_agent.py
sink_connector.py
**
$ ./sink_connector.py
* It needs to be run only once per environment, for initialization reasons
** They can run on the same server or different ones
Once completed above section, How to install. Tests can be run like:
# Validate that sensitive data is protected
$ ./tests/security_test1.sh
$ ./tests/security_test2.sh
# Validate that infrastructure is properly created
$ python3 -m pytest tests/tests.py
# Validate that all parts work together: URL monitoring, Kafka communication and DB storing
$ ./tests/integration_test.sh
/home/user1/homeworks
)${_WORKSPACE_PATH}/tests/service.cert
), available on your Aiven console: Services -> \<Your Kafka> -> Overview -> Access Certificate
IMPORTANT! (Although it's encrypted) Do not forget to set service.cert to read-only for file owner (chmod 0600 service.cert
) and exclude it from git repository.${_WORKSPACE_PATH}/tests/service.key
), available on your Aiven console: Services -> \<Your Kafka> -> Overview -> Access Key
IMPORTANT! Do not forget to set service.key to read-only for file owner (chmod 0600 service.key
) and exclude it from git repository.${_WORKSPACE_PATH}/tests/ca.pem
), available on your Aiven console: Services -> \<Your Kafka> -> Overview -> CA Certificatekafka.aivencloud.com
), available on your Aiven console: Services -> \<Your Kafka> -> Overview -> Host2181
), available on your Aiven console: Services -> \<Your Kafka> -> Overview -> Portweb_monitoring
)60
)${_WORKSPACE_PATH}/tests/list_web_domains.txt
)True
for performance reasonspostgres.aivencloud.com
), available on your Aiven console: Services -> \<Your PostgresSQL> -> Overview -> Hostavnadmin
), available on your Aiven console: Services -> \<Your PostgresSQL> -> Overview -> Userp4ssW0rd1
), available on your Aiven console: Services -> \<Your PostgresSQL> -> Overview -> Password5432
), available on your Aiven console: Services -> \<Your PostgresSQL> -> Overview -> Portrequire
). For a full list of values, check PostgresSQL documentationweb_health_metrics
)Review the list of TODOs
I would like to reference some useful information sources which have been crucial for the implementation of this solution and give thanks to their creators for their collaboration: