Question about your paper

trungkien210493 commented 2 years ago

Hi @cpflat , I read your paper carefully ("Latent Semantics Approach for Network Log Analysis: Modeling and its application"). It is a great work. I have a little bit confuse about how to you model topic. Actually, you have 34.7M logs and 1789 log templates in SINET4 dataset. Did you apply topic modeling for corpus from 1789 log templates (each log template as a document -> corpus with 1789 inputs) or 34.7M logs (convert raw log to log template and consider a log template as a document -> corpus with 34.7M inputs) ?
Also, in your paper, it seems that you did not preprocess with the periodic such as cron job in certain time. e.g. a cron job always run in 0h. It only effect the topic distribution in the specific time and it can effect the result when the ticket reference is mismatch with this specific time (e.g you have a ticket in other time and have a similar issue in the time you run cron job -> it can make the cosine similarity decrease because cron job effect the topic distribution)

cpflat commented 2 years ago

Sorry for the late reply.

We use a corpus with 1789 inputs. As the log messages are converted into log templates as semantics analysis inputs, the 34.7M inputs include too many duplications. Also, in the viewpoint of processing time, the full log messages should not be used for semantics analysis because the analysis won't be completed in an acceptable processing time. Furthermore, if we use one input for one log template, we can focus more on minor log messages that are usually more important than repeated messages in network troubleshooting. That's why we use the smaller inputs.

Basically, the time-series factor is meaningless at least in the semantics analysis part. We did not consider the periodicity issue in the application with LogCluster for a fair comparison evaluation. At least in our experiments, the periodicity is not a root cause of estimation failures in our manual survey. If needed, we've discussed the way to remove periodic components from log time-series data in another paper as you may know: https://doi.org/10.1109/TNSM.2017.2778096

trungkien210493 commented 2 years ago

Thank for your response. It is more clear for me. Now, I can understand the reason you choose the template as input. The periodic component is not a root cause, but it can affect the topic distribution if the number of occurrences is enormous. Such as in my case, a script always run in 0h and it spam the Interactive command in log and it can make the decrease consine similarity when I compare with the same error not in 0h.

About the periodic event, I also refer to your paper "Mining causality of network events in log data", "Causal analysis of network logs with layered protocols and topology knowledge" and the latest version "A Quantitative Causal Analysis for Network Log Data". I also read your implementation in logdag. As your response, it seems that you use the preprocessed data as input for detection. In my opinion (also I tried the preprocessing as Fourier transform with my data), it has 2 problems:

The processing requires lots of data to remove long interval, that mean you need data with time range is more than interval to capture and to detect interval, in particular in the case a script always run in a specific time in a day (e.g. 0h in a day). Then, we need to trade off between removing periodic data and fast detection. I am trying to build a job to detect anomaly based on your paper in each hour. Then, if each hour, I need to collect one day data to preprocessing, it will be in trouble in resource computing each time.
The result after remove with Fourier can be negative or not integer. In your code "https://github.com/amulog/logdag/blob/master/logdag/source/period.py", in "restore_data" function, you use the " np.median", in some case, it can return the "periodic_cnt" is not integer, so, the time series will be not integer (because the number of occurrences for a template is always integer). Also, it can be negative because of delay time in log (cause by high load system) or missing data (the device is down). E.g. the count of a log template in time series can be: 5, 7 (because of delay and time bin size is small, e.g. 10s, the log comes to next time bin), 6, 6, 6, 6, ,6 then after filter, it will be -1, 1, 0, 0, 0, 0, 0. How do you treat with this problem ???

cpflat commented 2 years ago

The processing requires lots of data to remove long interval

I understand the tradeoff. If we only consider 1-day data, we will miss the per-day periodicity. If we use longer data, it requires a large processing time. One possible solution is using a time-series DB (such as influxdb) to store the input of preprocessing. With it, we can query time-series data with different granularity in a short time, because the DB is designed mainly for fast and dynamic visualization of time-series data. If you need to focus on periodicity with long intervals, you can query time-series with large granularity, e.g., 1-week or 1-month data with a bin size of 1 hour.

The result after remove with Fourier can be negative or not integer.

Sorry that I don't have enough time to validate my code right away. If that is correct, it's just a bug or a design error. We basically use the source/period.py module from source/filter_log.py. You can see the preprocessed time-series data is restored into list of datetime (on LogFilter.filter_periodic), and the negative value seems just ignored (same as 0). At least in my memory, I tried to subtract 5,5,5,5,5 (not 6,6,6,6,6) (i.e., min(data[periodic_time] rather than median(data[periodic_time]))) in the case of your example. Anyway, pull requests are welcome if you have some other solution.

trungkien210493 commented 2 years ago

Thank you for your response.

One possible solution is using a time-series DB (such as influxdb) to store the input of preprocessing.

My point of view is trying to detect as soon as possible. With my case, the number of host can be more than 500 device, it still huge load if I use influxdb to count the template occurrences and group by time in time bin (60s as your paper recommend) (my monitoring system also use influxdb as backend so I have lot of experiments with influx). I will try to tuning the range to see the impact.

You can see the preprocessed time-series data is restored into list of datetime (on LogFilter.filter_periodic), and the negative value seems just ignored (same as 0)

Yes, I see your code in filter_periodic, you ignore the value less or equal 0. But I guess that in "_revert_event", it should be l_dt += [dt] instead of l_dt += [dt] * int(val) (add the time index when value is more than 0, as the code you comment). Otherwise, based on your slide in "http://sat.hongo.wide.ad.jp/papers/crw2018.pdf" I guess that the min(data[periodic_time] is more reasonable than median (as the slide number 36), and after that, subtract will always return a value more than or equal 0. I think that it is clear now. Thank you again for your kind support. I will close this issue, try to validate the periodic in my case to testing the median and the min and make a PR if need.

amulog / amulog-semantics

Question about your paper #1