Some questions about trace_parser.py

FDUDSDE / MAGIC

Codes and data for USENIX Security 24 paper "MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning"

MIT License

64 stars 10 forks source link

Some questions about trace_parser.py #9

Closed jiafengren closed 5 months ago

jiafengren commented 6 months ago

I would like to ask why the test part of trace data includes the train part, but the cadets and theia data are not processed in the same way. metadata = { 'trace':{ 'train': ['ta1-trace-e3-official-1.json.0', 'ta1-trace-e3-official-1.json.1', 'ta1-trace-e3-official-1.json.2', 'ta1-trace-e3-official-1.json.3'], 'test': ['ta1-trace-e3-official-1.json.0', 'ta1-trace-e3-official-1.json.1', 'ta1-trace-e3-official-1.json.2', 'ta1-trace-e3-official-1.json.3', 'ta1-trace-e3-official-1.json.4'] }, 'theia':{ 'train': ['ta1-theia-e3-official-6r.json', 'ta1-theia-e3-official-6r.json.1', 'ta1-theia-e3-official-6r.json.2', 'ta1-theia-e3-official-6r.json.3'], 'test': ['ta1-theia-e3-official-6r.json.8'] }, 'cadets':{ 'train': ['ta1-cadets-e3-official.json','ta1-cadets-e3-official.json.1', 'ta1-cadets-e3-official.json.2', 'ta1-cadets-e3-official-2.json.1'], 'test': ['ta1-cadets-e3-official-2.json'] } } Another question is why MemoryObjects and UnnamedPipeObject nodes are not tested as malicious entities. for e in malicious_entities: if e in test_node_map and e in id_nodetype_map and id_nodetype_map[e] != 'MemoryObject' and id_nodetype_map[e] != 'UnnamedPipeObject': final_malicious_entities.append(test_node_map[e]) if e in id_nodename_map: malicious_names.append(id_nodename_map[e]) f.write('{}\t{}\n'.format(e, id_nodename_map[e])) else: malicious_names.append(e) f.write('{}\t{}\n'.format(e, e))

Jimmyokok commented 6 months ago

We have ensured in our pre-processing and evaluation steps that the train and test entities won't overlap. We use the ThreaTrace groundtruth label, and ThreaTrace uses trace-1.4, theia-6.8 and cadets-2.0 for test. For Trace, we found that several hundred groundtruth malicious entities are not included in trace.4. We later discovered that they can are defined in trace.0,1,2 and 3.

Because we simply discard all MemoryObjects and UnnamedPipeObjects. We take only files, processes and netflows into consideration:


subject_type = pattern_type.findall(line)
if len(subject_type) < 1:
if 'com.bbn.tc.schema.avro.cdm18.MemoryObject' in line:
    subject_type = 'MemoryObject'
if 'com.bbn.tc.schema.avro.cdm18.NetFlowObject' in line:
    subject_type = 'NetFlowObject'
if 'com.bbn.tc.schema.avro.cdm18.UnnamedPipeObject' in line:
    subject_type = 'UnnamedPipeObject'
else:
subject_type = subject_type[0]

if uuid == '00000000-0000-0000-0000-000000000000' or subject_type in ['SUBJECT_UNIT']: continue id_nodetype_map[uuid] = subject_type if 'FILE' in subject_type and len(pattern_file_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_file_name.findall(line)[0] elif subject_type == 'SUBJECT_PROCESS' and len(pattern_process_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_process_name.findall(line)[0] elif subject_type == 'NetFlowObject' and len(pattern_netflow_object_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_netflow_object_name.findall(line)[0]

Jimmyokok commented 6 months ago

This piece of code in eval.py is specially designed for Trace, to split train and test entities:

x_train = []
for i in range(n_train):
    g = load_entity_level_dataset(dataset_name, 'train', i).to(device)
    x_train.append(model.embed(g).cpu().numpy())
    del g
x_train = np.concatenate(x_train, axis=0)
skip_benign = 0
x_test = []
for i in range(n_test):
    g = load_entity_level_dataset(dataset_name, 'test', i).to(device)
    # Exclude training samples from the test set
    if i != n_test - 1:
        skip_benign += g.number_of_nodes()
    x_test.append(model.embed(g).cpu().numpy())
    del g
x_test = np.concatenate(x_test, axis=0)

n = x_test.shape[0]
y_test = np.zeros(n)
y_test[malicious] = 1.0
malicious_dict = {}
for i, m in enumerate(malicious):
    malicious_dict[m] = i

# Exclude training samples from the test set
test_idx = []
for i in range(x_test.shape[0]):
    if i >= skip_benign or y_test[i] == 1.0:
        test_idx.append(i)
result_x_test = x_test[test_idx]
result_y_test = y_test[test_idx]

Explanation:

Train graphs from preprocessing are malicious-free, thus the entities in these graphs directly become training samples.
Test graphs contain:
- (1) Entities from train graphs (Benign entities from trace-1.0,1,2,3)
- (2) Benign entities from trace-1.4
- (3) Malicious entities from trace.0,1,2,3,4
What this piece of code is doing is to eliminate (1) from the test set, so the train and test set do not overlap.

jiafengren commented 5 months ago

因为我们只是丢弃了所有 MemoryObjects 和 UnnamedPipeObjects。我们只考虑文件、进程和网络流：

subject_type = pattern_type.findall(line)
if len(subject_type) < 1:
    if 'com.bbn.tc.schema.avro.cdm18.MemoryObject' in line:
        subject_type = 'MemoryObject'
    if 'com.bbn.tc.schema.avro.cdm18.NetFlowObject' in line:
        subject_type = 'NetFlowObject'
    if 'com.bbn.tc.schema.avro.cdm18.UnnamedPipeObject' in line:
        subject_type = 'UnnamedPipeObject'
else:
    subject_type = subject_type[0]

if uuid == '00000000-0000-0000-0000-000000000000' or subject_type in ['SUBJECT_UNIT']:
    continue
id_nodetype_map[uuid] = subject_type
if 'FILE' in subject_type and len(pattern_file_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_file_name.findall(line)[0]
elif subject_type == 'SUBJECT_PROCESS' and len(pattern_process_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_process_name.findall(line)[0]
elif subject_type == 'NetFlowObject' and len(pattern_netflow_object_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_netflow_object_name.findall(line)[0]

Sorry for not seeing your reply in time. For MemoryObjects and UnnamedPipeObjects type nodes, it is noted that they are still used as entities in the provenance graph for training and testing, but are not identified as malicious nodes during testing. I would like to know what the basis for this is.

Jimmyokok commented 5 months ago

Node type consistency between training and test set is expected but we have to admit that the current setting is contrary to what we previously believe to be. MemoryObjects and UnnamedPipeObjects nodes are supposed to be removed from the training set.

Honestly speaking, I don't believe anybody is able to identify which MemoryObjects and UnnamedPipeObjects are exactly malicious from the DARPA TC groundtruth document, also I don't know how the ThreaTrace team managed to label them. But for the sake of consistency and for reference, we immediately performed two training-from-scratch evaluations on E3-Trace:

MemoryObjects and UnnamedPipeObjects nodes appearing in both training and test set

AUC: 0.999937585589607
F1: 0.992804286807859
PRECISION: 0.986353956917906
RECALL: 0.999339536795139
TN: 615055
FN: 45
TP: 68089
FP: 942
#Test_AUC: 0.9999±0.0000

MemoryObjects and UnnamedPipeObjects removed from both set.
```
AUC: 0.9997069102380969
F1: 0.9997429433075163
PRECISION: 0.9998384063932307
RECALL: 0.9996474994492179
TN: 40137
FN: 24
TP: 68061
FP: 11
#Test_AUC: 0.9997±0.0000
```
Additionally, removing these nodes greatly speeds up MAGIC, making it even more time and memory-efficient.

jiafengren commented 5 months ago

Thank you for your patient reply.