d0ng1ee / logdeep

log anomaly detection toolkit including DeepLog
MIT License
387 stars 115 forks source link

Question about feature extraction on bgl dataset #10

Open cherishwsx opened 4 years ago

cherishwsx commented 4 years ago

Hi, it's me again. :)

I'm trying to perform the Deeplog model on bgl dataset. So far, I was able to understand the logic and generate the event sequences from structured bgl log dataset using this sample_bgl.py that you provided (many thanks!!).

It basically slides a 30-min window with 12-min step size on the structured bgl log. And in this case, we will end up having event sequence that contains either huge amount of events (e.g. I found a event sequence with 12514 events in it...) or event sequence with one or no event in it (since there is no event happen at that time period in the sliding window).

After generating the event sequences, I deleted the event sequences with no event and ended up getting a file with 65 non-empty event sequences. And I randomly picked 60 event sequences as my training sequences and rest of the 5 will be validation data.

And this is when my questions kick in.

  1. When I tried to generate the sequential feature for the training dataset, should I do the same thing for hdfs dataset like using sliding window of 10 or other size on the event sequence and the next event to the current window will be the label for this current window. But in this case, how should I deal with the event sequence with only 1 event?

  2. And I do remember that you mentioned in the other post, if using bgl dataset, then it can direcly use the event sequence for the sequential vector since it is generated using the sliding window already, but in this case, in my understanding, each event sequence (except for the last event) will directly be a sequential vector, then the label for this vector will be the last event in that event sequence? Then what about the event sequence with only 1 event?

Look forward to your valueble feedback!! And thank you for answering all of my questions!!!

d0ng1ee commented 4 years ago

If you use machine learning methods, you can directly use the sequence obtained from the sliding window to extract features. I have tried to use the bgl data set on https://github.com/logpai/loglizer. I set window_size=1h and step_size=0.5h to get a better result, I did not continue to do experiments on the bgl data set on the lstm model :(

But the lstm method requires the input length to be consistent, so you need to set a fixed window like hdfs for unsupervised learning.

I think the event sequence with only 1 event can be ignored as noise during training(donot join training), and can be simply padding during testing. Of course, this is my simple understanding. . .

cherishwsx commented 4 years ago

Thank you for the reply!

I think I will try tuning the window_size and step_size to generate the event sequences for bgl data to get a more evenly distributed length of event sequences (ideally not having sequence length varies from 1, 2, 3 to 12541...), then I can proceed to set a better fixed window on the event sequences just like what we do on hdfs to generate sequential vector and then fit the lstm model. :)