d0ng1ee / logdeep

log anomaly detection toolkit including DeepLog
MIT License
387 stars 115 forks source link

hdfs_train sequence file doesn't correspond to the sequence file generated for 100k structured file provided in the repository #21

Open Zanis92 opened 3 years ago

Zanis92 commented 3 years ago

Hi,

Can you kindly let me know how you got 4855 sequences in hdfs_train? While I used your 'sample_hdfs.py' script to generate a sequence file from a 100k structured file provided by you and it generates 7940 sequences. Any help would be highly appreciated.

Thanks

zeinabfarhoudi commented 3 years ago

Hi @donglee-afar, I have the same question: Can you let me know how to get sequences in hdfs_train? The sequences in HDFS_sequence.csv file is different from hdfs_train Thanks for sharing your code

ZanisAli commented 3 years ago

Hi @donglee-afar, I have the same question: Can you let me know how to get sequences in hdfs_train? The sequences in HDFS_sequence.csv file is different from hdfs_train Thanks for sharing your code

There is sample file in the code to generate the sequences specifically for the BGL dataset, you can use that and use Block ID's instead as in anomaly_label.csv file for HDFS contain labels for Block_ID's

JinYang88 commented 3 years ago

Same question.

ZanisAli commented 3 years ago

@JinYang88 , I understood that the file used for generating the sequence file is different from the one provided for HDFS. Most probably the file generated by some template identification/ log parsing technique. So, there is no issue anymore. :-)

JinYang88 commented 3 years ago

@ZanisAli

I am sorry I do not really understand that, I found that training data can seriously affect the results, could you please explain how to get the hdfs_train?

ZanisAli commented 3 years ago

@JinYang88 True, the training data does affect results a lot but what I mean to say is that you don't need the exact training data because the author didn't mention that they got from the same structured file that is provided in the repository. Here is the script to get the data provided by the author : https://github.com/donglee-afar/logdeep/blob/master/data/sampling_example/sample_hdfs.py

Moreover, after getting this, you can divide it as you wish like 20% training or 80% training etc. all depends on you.

JinYang88 commented 3 years ago

@ZanisAli I would like to use the semantic information for each tempalte instead of only IDs, so I want to know how hdfs_train is got from the raw log data in order to reconstruct the raw template for each ID. Do you have any clues?

ZanisAli commented 3 years ago

@JinYang88 What I understood from you is that you want to have semantic information then most probably you are talking about the event2semantic_vector.json file that is provided by the author and that is not used by DeepLog at all. While for the event2semantic vector the author provided the code in Issue#3 and that will provide the semantic information about the templates. The hdfs_train doesn't provide the semantic data or information as it doesn't care anything about the templates itself but care only about the ID's of the templates.

JinYang88 commented 3 years ago

@ZanisAli Thanks for your helpful advice!

I checked Issue#3, which is really what I want, but in the code the author provided, the mapping eventid2template.json is missing, which should be used to find corresponding templates in hdfs_train.

ZanisAli commented 3 years ago

@JinYang88 In the code 1, eventid2template.json can be found as an output file

JinYang88 commented 3 years ago

@ZanisAli Many thanks!!!!!!

JinYang88 commented 3 years ago

@ZanisAli But the file templates.txt for HDFS is also missing.

ZanisAli commented 3 years ago

@JinYang88 templates.txt are the templates that are identified by the log parsing techniques like Drain. You can read templates from the output of log parsing technique and generate a txt file. Now there are many things that are not provided by the author like en_core_web_sm that you can get from the spacy library or cc.en.300.vec that you can get from https://fasttext.cc/docs/en/crawl-vectors.html. So, what I mean is you need to research things a bit as the author can't provide 4-5 GB of data in the Github repository :-). I hope that answers your question

JinYang88 commented 3 years ago

@ZanisAli Really thanks for your help, I understand how to generate the templates by myself, but the question is I would like to get exactly the same templates-id mapping used by the author, because the order of tempaltes in tempaltes.txt is the id used in hdfs_train.

ZanisAli commented 3 years ago

@JinYang88 Based on my information, the templates ID doesn't matter when they are all referenced in a context. For example for template T1 id is ID1 and that what you want, but in general if you start giving ID5 of T1 and then ID6 of T2 and so on then the number of ID's doesn't matter at all as the anomaly detection technique doesn't care if you gave ID1 or ID2. One thing if you start from ID5 then you might need to change a lot of implementation because many things are hard-coded in the implementation.

While coming back to your question, the author started from ID1 so you can use mapping {item: i for i, item in enumerate(struct_file['EventId'].unique(), start=1)} so in this way it will get the ID1 for Template1 and so on.

JinYang88 commented 3 years ago

@ZanisAli Great, thanks for your help.

BTW, do you happen to know where to download the full OpenStack dataset used in the original DeepLog paper? The link for homepage of Min Du does not work anymore.

ZanisAli commented 3 years ago

Hi,

They are all available at LogHub git repository. You can just search with the name and will find the first or second link.

On Wed, 12 May 2021, 03:09 LIU, Jinyang, @.***> wrote:

@ZanisAli https://github.com/ZanisAli Great, thanks for your help.

BTW, do you happen to know where to download the full OpenStack dataset used in the original DeepLog paper? The link for homepage of Min Du does not work anymore.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/donglee-afar/logdeep/issues/21#issuecomment-839357431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH2RPNO54PULCVTMC6BYG2DTNHIN3ANCNFSM4ZZHNBWQ .

JinYang88 commented 3 years ago

@ZanisAli Yes it is, but the OpenStack data maintained in loghub is not the full version with more than 10M logs.

ZanisAli commented 3 years ago

@JinYang88 This is the one according to them the complete log https://zenodo.org/record/3227177#.YJuhypMza3c

JinYang88 commented 3 years ago

@ZanisAli The OpenStack in this link (only < 10MB) is not the complete version.

tongxiao-cs commented 2 years ago

The DeepLog paper says, "DeepLog needs a small fraction of normal log entries to train its model. In the case of HDFS log, only less than 1% of normal sessions (4,855 sessions parsed from the first 100,000 log entries compared to a total of 11,197,954) are used for training."

So I'm still wondering how to get the 4855 sequences in hdfs_train. Any ideas?