the result is not reproducible

mannnnntheng commented 2 years ago

Hi, I was trying out deeplog using HDFS1 dataset (used only first 1m lines parsed by Drain).

I run it with the following parameters settings: python main_run.py --folder=hdfs_1m/ --log_file=HDFS_1m.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process

This is the result: Precision: 86.964%, Recall: 53.931%, F1-measure: 66.576%, Specificity: 0.996 (I have tried with different param settings, there's not much of a difference)

Could you please have an advice for me? Thanks in advance!

vanhoanglepsa commented 2 years ago

Hi,

We haven't included every parts of our work to this repository yet, so it might be quite difficult to use. We will include other parts in the near future.

For your problem, I think you should run on the full HDFS dataset. Because for this dataset, we need to group log by session. If you only use the first 1m logs, there might be lots of incompled session. These sessions might not contain any anomalies because the anomalies can only occur at the end of session. So that the model could identify lots of abnormal sequences as normal (low Recall).

Thanks

mannnnntheng commented 2 years ago

I see, thank you!

alishan2040 commented 2 years ago

@mannnnntheng did you make any changes in HDFS_1m logs? For example, did you remove the header line in the CSV file? how did you find the embeddings.json which is mandatory for running code? Could you please suggest the changes you made to reproduce the results? Thanks!

LogIntelligence / LogADEmpirical

the result is not reproducible #3