EvoTestOps / LogLead

LogLead stands for Log Loader, Enhancer, and Anomaly Detector.
MIT License
15 stars 2 forks source link

Put test data into repository #28

Closed bakhtos closed 1 month ago

bakhtos commented 4 months ago

Currently the actual tests cannot be run because the data is only on authors' computer Since there seems to be a special test batch, it can be checked into the repository.

mmantyla commented 4 months ago

There are a couple of reasons for this. 1) The data is roughly 500gigs so too much for Github I think. 2) We do not have a license or permission to share the data.

To implement this we need a routine that first downloads the full data if data is not already available in disk. Only after that can tests be conducted. Below is copy-paste from Readme that outlines the data sources.

3: HDFS_v1, Hadoop, BGL thanks to amazing LogHub team. For full data see Zenodo. 3: Sprit, Thunderbird and Liberty can be found from Usenix site. 2: Nezha has data from two systems TrainTicket and Google Cloud Webshop demo. It is the first dataset of microservice-based systems. Like other traditional log datasets it has Log data but additionally there are Traces and Metrics. 2: ADFA and AWSCTD are two datasets designed for intrusion detection.

bakhtos commented 4 months ago

What I mean is - the test files seen to use some subset of data https://github.com/EvoTestOps/LogLead/blob/c1c749829d04914d1ed5fd5d0415f927ea9cae2b/tests/anomaly_detectors.py#L14

Or is it still a huge dataset that is not allowed to share?

mmantyla commented 4 months ago

The tests run through the following steps. They are linked via saved files with naming convention indicating which phase has created the data. First step takes in the huge raw data we have no permission to share. First step also samples it down and saves for steps 2 and 3. Step 0 is needed for downloading so that anyone can execute the full pipeline. Currently, step 0 has been done manually.

  1. https://github.com/EvoTestOps/LogLead/blob/c1c749829d04914d1ed5fd5d0415f927ea9cae2b/tests/loaders.py
  2. https://github.com/EvoTestOps/LogLead/blob/c1c749829d04914d1ed5fd5d0415f927ea9cae2b/tests/enhancers.py
  3. https://github.com/EvoTestOps/LogLead/blob/c1c749829d04914d1ed5fd5d0415f927ea9cae2b/tests/anomaly_detectors.py
mmantyla commented 3 months ago

The latest commit fixes this. Now data is downloaded when one runs main.py in tests folder. https://github.com/EvoTestOps/LogLead/commit/ddc85341cff53b9d53310c6467fe1d0ee44d59c3

Data download can also be run separately. It does not overwrite. Rather it checks if folder exists and if it does it does not download https://github.com/EvoTestOps/LogLead/blob/main/tests/download_data.py

This config file controls what gets downloaded and tested. Commenting out rows disables downloading and testing https://github.com/EvoTestOps/LogLead/blob/main/tests/datasets.yml

mmantyla commented 3 months ago

@jnyyssol you had found couple of new datasets. Can you add them to config so they also get downloaded? Please also add them to tests.

jnyyssol commented 3 months ago

@jnyyssol you had found couple of new datasets. Can you add them to config so they also get downloaded? Please also add them to tests.

This is a bit tricky, because the ADFA and AWSCTD datasets already consist of event IDs. Therefore most of the enhancements don't make sense, and some even cause it to crash. I got the tests to run with ADFA and AWSCTD by doing the following:

  1. Download the data (needs new py7zr package to unpack .7z)
  2. Load the data and save these two directly with _eh which indicates they have been enhanced
  3. This will cause the enhancer to skip them
  4. Add a check in anomaly_detectors.py to ensure numeric columns exist (which they don't in these datasets, so many tests are skipped)

@mmantyla do you think it makes sense to include these two in the tests given that they are so different? Before pushing I still need to check that my changes didn't break anything regarding the other datasets.

mmantyla commented 1 month ago

I am closing this. Full test data will never be in LogLead repo. However, tests folders already has mechanism of downloading all supported datasets