LogIntelligence / LogADEmpirical

Log-based Anomaly Detection with Deep Learning: How Far Are We? (ICSE 2022, Technical Track)

MIT License

162 stars 40 forks source link

embeddings.json #7

Open stephen-hayne opened 2 years ago

stephen-hayne commented 2 years ago

I'm trying to reproduce your results (like another poster here)...

Perhaps a silly question, but after downloading the HDFS and BGL datasets, running them through Drain, I'm now getting this error - can you advise how/where to get your "embeddings.json" file?

python3 main_run.py --folder=hdfs/ --log_file=HDFS.log --dataset_name=hdfs --device=cpu --model_name=deeplog --window_type=session --sample=sliding_window --is_logkey --train_size=0.4 --train_ratio=1 --valid_ratio=0.1 --test_ratio=1 --max_epoch=100 --n_warm_up_epoch=0 --n_epochs_stop=10 --batch_size=1024 --num_candidates=70 --history_size=10 --lr=0.001 --accumulation_step=5 --session_level=hour --window_size=50 --step_size=50 --output_dir=experimental_results/deeplog/session/cd2 --is_process
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading ./dataset/hdfs/HDFS.log_structured.csv
575061it [00:00, 1983685.17it/s]
11175629it [00:19, 566251.66it/s]
Save options parameters
vocab size 20
save vocab in experimental_results/deeplog/session/cd2hdfs/deeplog_vocab.pkl
Loading vocab
20
Loading train dataset

Traceback (most recent call last):
  File "main_run.py", line 213, in <module>
    main()
  File "main_run.py", line 195, in main
    run_deeplog(options)
  File "/stephen/LogADEmpirical/logadempirical/deeplog.py", line 26, in run_deeplog
    Trainer(options).start_train()
  File "/stephen/LogADEmpirical/logadempirical/logdeep/tools/train.py", line 101, in __init__
    train_logs, train_labels = sliding_window(data,
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 108, in sliding_window
    event2semantic_vec = read_json(os.path.join(data_dir, e_name))
  File "/stephen/LogADEmpirical/logadempirical/logdeep/dataset/sample.py", line 14, in read_json
    with open(filename, 'r') as load_f:
FileNotFoundError: [Errno 2] No such file or directory: './dataset/hdfs/embeddings.json'

vanhoanglepsa commented 2 years ago

Hi, Currently, we adopt LogRobust to generate the embedding file. For now, it isn't included in this repository. We will try to start updating this part in the next week.

stephen-hayne commented 2 years ago

I have read "Robust Log-Based Anomaly Detection on Unstable Log Data" and "Log-based Anomaly Detection Without Log Parsing" with interest (as well as several of the others in the citations).

Will LogRobust be put on github? Or just the data you generated?

vanhoanglepsa commented 2 years ago

We will add the code to generate embeddings in this repository, not only the generated data.

X-zhihao commented 2 years ago

How can we get this HDFS.log_structured.csv?

souravs17031999 commented 1 year ago

@vanhoanglepsa is the code updated to generate embeddings for generic log data ? @stephen-hayne were you able to resolve this issue ? I am having same issue.

stephen-hayne commented 1 year ago

@souravs17031999 No, this issue is not resolved.
@vanhoanglepsa Can you please help us to reproduce your work?

pupuu555 commented 1 year ago

hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!!

stephen-hayne commented 1 year ago

Yes - I haven't been able to find the file you mentioned either...

I have succeeded generated the embedding.json by this code. Hope it helps! https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py

-- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/"

On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.***> wrote:

hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!!

— Reply to this email directly, view it on GitHub https://github.com/LogIntelligence/LogADEmpirical/issues/7#issuecomment-1499829481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: @.***>

xichie commented 1 year ago

Hi, the code following is my used to generate embedding.json. Hope it helps!
from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder): """ Special json encoder for numpy types """ def default(self, obj): if isinstance(obj, (np.int, np.intc, np.intp, np.int8, np.int16, np.int32, np.int64, np.uint8, np.uint16, np.uint32, np.uint64)): return int(obj) elif isinstance(obj, (np.float, np.float16, np.float32, np.float64)): return float(obj) elif isinstance(obj, (np.ndarray,)): return obj.tolist() return json.JSONEncoder.default(self, obj)

Specify logger

logger = logging.getLogger('embedding') logger.setLevel(logging.INFO) dataset = 'bgl' save_path = './dataset/bgl' templatesDir = './dataset/bgl' log_file = 'BGL_all.log' logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset, log_file=log_file) templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader: templateVocab = {} line_num = 0 for line in reader.readlines(): if line_num == 0: vocabSize, embedSize = [int(x) for x in line.strip().split()] else: items = line.strip().split() if len(items) != embedSize + 1: continue template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64) for logID, temp in logID2Temp.items(): if temp == template_word: templateVocab[logID] = template_embedding line_num += 1 replica_logIDs = [] for logId in logID2Temp.keys(): if logID not in templateVocab.keys(): replica_logIDs.append(logID)

有重复的template

for logID in replica_logIDs:  
    temp = logID2Temp[logID]
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            if temp == template_word:
                templateVocab[logID] = template_embedding
        line_num += 1

with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer: json.dump(templateVocab, writer, cls=NumpyEncoder)

pupuu555 commented 1 year ago

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助！ https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py> …… -- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/" On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: @.>

Thank you sooooo much！！！！！！

pupuu555 commented 1 year ago

您好，下面的代码是我用来生成 embedding.json 的。希望能帮助到你！

from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

# Specify logger
logger = logging.getLogger('embedding')
logger.setLevel(logging.INFO)
dataset = 'bgl'
save_path = './dataset/bgl'
templatesDir = './dataset/bgl'
log_file = 'BGL_all.log'
logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset,
                                                               log_file=log_file)
templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader:
    templateVocab = {}
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            for logID, temp in logID2Temp.items():
                if temp == template_word:
                    templateVocab[logID] = template_embedding
        line_num += 1
    replica_logIDs = []
    for logId in logID2Temp.keys():
        if logID not in templateVocab.keys():
            replica_logIDs.append(logID)
    # 有重复的template
    for logID in replica_logIDs:  
        temp = logID2Temp[logID]
        line_num = 0
        for line in reader.readlines():
            if line_num == 0:
                vocabSize, embedSize = [int(x) for x in line.strip().split()]
            else:
                items = line.strip().split()
                if len(items) != embedSize + 1: continue
                template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
                if temp == template_word:
                    templateVocab[logID] = template_embedding
            line_num += 1 
with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer:
    json.dump(templateVocab, writer, cls=NumpyEncoder)

Thank you sooooo much！！！！好人一生平安

pupuu555 commented 1 year ago

Hi, the code following is my used to generate embedding.json. Hope it helps!

from logadempirical.PLELog.data.Embedding import *
from logadempirical.PLELog.data.DataLoader import *
import logging
import json

class NumpyEncoder(json.JSONEncoder):
    """ Special json encoder for numpy types """
    def default(self, obj):
        if isinstance(obj, (np.int_, np.intc, np.intp, np.int8,
                            np.int16, np.int32, np.int64, np.uint8,
                            np.uint16, np.uint32, np.uint64)):
            return int(obj)
        elif isinstance(obj, (np.float_, np.float16, np.float32,
                              np.float64)):
            return float(obj)
        elif isinstance(obj, (np.ndarray,)):
            return obj.tolist()
        return json.JSONEncoder.default(self, obj)

# Specify logger
logger = logging.getLogger('embedding')
logger.setLevel(logging.INFO)
dataset = 'bgl'
save_path = './dataset/bgl'
templatesDir = './dataset/bgl'
log_file = 'BGL_all.log'
logID2Temp, templates = load_templates_from_structured(templatesDir, logger, dataset,
                                                               log_file=log_file)
templateVocab = nlp_emb_mergeTemplateEmbeddings_BGL(save_path, templates, dataset, logger)

with open(os.path.join(save_path, 'templates_BGL.vec'), 'r', encoding='utf-8') as reader:
    templateVocab = {}
    line_num = 0
    for line in reader.readlines():
        if line_num == 0:
            vocabSize, embedSize = [int(x) for x in line.strip().split()]
        else:
            items = line.strip().split()
            if len(items) != embedSize + 1: continue
            template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
            for logID, temp in logID2Temp.items():
                if temp == template_word:
                    templateVocab[logID] = template_embedding
        line_num += 1
    replica_logIDs = []
    for logId in logID2Temp.keys():
        if logID not in templateVocab.keys():
            replica_logIDs.append(logID)
    # 有重复的template
    for logID in replica_logIDs:  
        temp = logID2Temp[logID]
        line_num = 0
        for line in reader.readlines():
            if line_num == 0:
                vocabSize, embedSize = [int(x) for x in line.strip().split()]
            else:
                items = line.strip().split()
                if len(items) != embedSize + 1: continue
                template_word, template_embedding = items[0], np.asarray(items[1:], dtype=np.float64)
                if temp == template_word:
                    templateVocab[logID] = template_embedding
            line_num += 1 
with open(os.path.join(save_path, 'embeddings.json'), 'w') as writer:
    json.dump(templateVocab, writer, cls=NumpyEncoder)

hi,When i ran the file you gave me,I met a new issue: FileNotFoundError: [Errno 2] No such file or directory: 'dataset/nlp-word.vec',how can i get the nlp-word.vec? I don't find a way to genearate this file in the code.

sailormoon-c commented 1 year ago

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助！ https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py> …… -- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/" On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.**> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: @.**>

Thank you sooooo much！！！！！！

可以加个微信吗？我最近被这个项目搞得头都快炸掉了，拜托拜托，我的微信是：RainyloveStatic

xichie commented 1 year ago

是的 - 我也找不到你提到的文件...... 我已经通过这段代码成功生成了 embedding.json。希望对您有所帮助！ https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py < https://github.com/xichie/LogADEmpirical/blob/master/generate_template_embedding.py> …… -- Dr. Stephen C. Hayne, Professor Emeritus (CSU) Cyber Security and Information Systems Consultant I love to fly formation! Nanchang - 443LM "http://selfsynchronize.com/hayne/plane/" On Thu, Apr 6, 2023 at 7:41 PM pupuu555 @.**> wrote: hi,have you add the code to generate embeddings in this repository? I haven't find the file, could you please tell me how to generate the embeddings?Or give me the embeddings.json? Thank you so much!! — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKIS2PHXIS4WRBZDJVNEWLW75WCZANCNFSM5VZAXOPQ . You are receiving this because you were mentioned.Message ID: @.**>

Thank you sooooo much！！！！！！

可以加个微信吗？我最近被这个项目搞得头都快炸掉了，拜托拜托，我的微信是：RainyloveStatic You can download nlp-word.vec by following: https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip