preprocess() in Drain.py has issue

ChangNamAn commented 3 years ago

Hi, The regular expression is set in process_bgl.sh as below. REGEX1='(0x)[0-9a-fA-F]+' REGEX2='\d+.\d+.\d+.\d+' REGEX3='(/[-\w]+)+' REGEX4='\d+

And they are passed to preprocess() in Drain.py with option. --regex="$REGEX1 $REGEX2 $REGEX3 $REGEX4" \

regex value is pass to self.rex in Drain.py and self.rex list has only 1 size when I checked it.

def preprocess(self, line):
    for currentRex in self.rex:
        line = re.sub(currentRex, '<*>', line)
    return line

When I tried to test with preprocess() above, I got the result as 0below. TP: 3, TN: 46, FP: 3, FN: 3 Precision: 50.00%, Recall: 50.00%, F1-measure: 50.00% And below is the partial log in the structured log file and I can see the log that did not parse.

And I modified the preprocess() as below. def preprocess(self, line): rex_list = self.rex[0].split(' ') for currentRex in rex_list: line = re.sub(currentRex, '<*>', line) return line

The test result is better more than before. TP: 6, TN: 44, FP: 1, FN: 0 Precision: 85.71%, Recall: 100.00%, F1-measure: 92.31%

And log parsing is better, I think. ciod: Message code 0 is not 51 or 4294967295,e872bbe9,ciod: Message code <> is not <> or <*>

What is your intention for preprocess()? My modification is correct?

ChangNamAn commented 3 years ago

Hi,

I found the correct way when I checked the original Drain_demo.py. (https://github.com/logpai/logparser/blob/master/demo/Drain_demo.py)

Regular expression list for optional preprocessing (default: []) regex = [ r'blk_(|-)[0-9]+' , # block id r'(/|)([0-9]+.){3}[0-9]+(:[0-9]+|)(:|)', # IP r'(?<=[^A-Za-z0-9])(-?+?\d+)(?=[^A-Za-z0-9])|[0-9]+$', # Numbers ] st = 0.5 # Similarity threshold depth = 4 # Depth of all leaf nodes

parser = Drain.LogParser(log_format, indir=input_dir, outdir=output_dir, depth=depth, st=st, rex=regex) parser.parse(log_file)

--regex="$REGEX1 $REGEX2 $REGEX3 $REGEX4" is passed the value as 1 size of list.

Please check it.

HelenGuohx commented 3 years ago

Thank you for letting me know. I used nargs='*' in argparse to receive multiple inputs as list (check this for usage ). But I should use --regex $REGEX1 $REGEX2 $REGEX3 instead of --regex="$REGEX1 $REGEX2 $REGEX3" in shell scripts

ChangNamAn commented 3 years ago

Not covered REGEX3='(?<=Warning: we failed to resolve data source name )[\w\s]+' argparse is processing it to ['(0x)[0-9a-fA-F]+', '\d+.\d+.\d+.\d+', '(?<=Warning:', 'we', 'failed', 'to', 'resolve', 'data', 'source', 'name', ')[\w\s]+', '\d+']

HelenGuohx commented 3 years ago

Try this in process_tbird.sh --regex "$REGEX1" "$REGEX2" "$REGEX3" "$REGEX4" data_process.py

parser.add_argument("--regex", nargs='*', help="regex to clean log messages", default='')

HelenGuohx / logbert

preprocess() in Drain.py has issue #8