FDUDSDE / MAGIC

Codes and data for USENIX Security 24 paper "MAGIC: Detecting Advanced Persistent Threats via Masked Graph Representation Learning"
MIT License
64 stars 10 forks source link

Some questions about trace_parser.py #9

Closed jiafengren closed 5 months ago

jiafengren commented 6 months ago

I would like to ask why the test part of trace data includes the train part, but the cadets and theia data are not processed in the same way. metadata = { 'trace':{ 'train': ['ta1-trace-e3-official-1.json.0', 'ta1-trace-e3-official-1.json.1', 'ta1-trace-e3-official-1.json.2', 'ta1-trace-e3-official-1.json.3'], 'test': ['ta1-trace-e3-official-1.json.0', 'ta1-trace-e3-official-1.json.1', 'ta1-trace-e3-official-1.json.2', 'ta1-trace-e3-official-1.json.3', 'ta1-trace-e3-official-1.json.4'] }, 'theia':{ 'train': ['ta1-theia-e3-official-6r.json', 'ta1-theia-e3-official-6r.json.1', 'ta1-theia-e3-official-6r.json.2', 'ta1-theia-e3-official-6r.json.3'], 'test': ['ta1-theia-e3-official-6r.json.8'] }, 'cadets':{ 'train': ['ta1-cadets-e3-official.json','ta1-cadets-e3-official.json.1', 'ta1-cadets-e3-official.json.2', 'ta1-cadets-e3-official-2.json.1'], 'test': ['ta1-cadets-e3-official-2.json'] } } Another question is why MemoryObjects and UnnamedPipeObject nodes are not tested as malicious entities. for e in malicious_entities: if e in test_node_map and e in id_nodetype_map and id_nodetype_map[e] != 'MemoryObject' and id_nodetype_map[e] != 'UnnamedPipeObject': final_malicious_entities.append(test_node_map[e]) if e in id_nodename_map: malicious_names.append(id_nodename_map[e]) f.write('{}\t{}\n'.format(e, id_nodename_map[e])) else: malicious_names.append(e) f.write('{}\t{}\n'.format(e, e))

Jimmyokok commented 6 months ago

if uuid == '00000000-0000-0000-0000-000000000000' or subject_type in ['SUBJECT_UNIT']: continue id_nodetype_map[uuid] = subject_type if 'FILE' in subject_type and len(pattern_file_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_file_name.findall(line)[0] elif subject_type == 'SUBJECT_PROCESS' and len(pattern_process_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_process_name.findall(line)[0] elif subject_type == 'NetFlowObject' and len(pattern_netflow_object_name.findall(line)) > 0: id_nodename_map[uuid] = pattern_netflow_object_name.findall(line)[0]

Jimmyokok commented 6 months ago

This piece of code in eval.py is specially designed for Trace, to split train and test entities:

x_train = []
for i in range(n_train):
    g = load_entity_level_dataset(dataset_name, 'train', i).to(device)
    x_train.append(model.embed(g).cpu().numpy())
    del g
x_train = np.concatenate(x_train, axis=0)
skip_benign = 0
x_test = []
for i in range(n_test):
    g = load_entity_level_dataset(dataset_name, 'test', i).to(device)
    # Exclude training samples from the test set
    if i != n_test - 1:
        skip_benign += g.number_of_nodes()
    x_test.append(model.embed(g).cpu().numpy())
    del g
x_test = np.concatenate(x_test, axis=0)

n = x_test.shape[0]
y_test = np.zeros(n)
y_test[malicious] = 1.0
malicious_dict = {}
for i, m in enumerate(malicious):
    malicious_dict[m] = i

# Exclude training samples from the test set
test_idx = []
for i in range(x_test.shape[0]):
    if i >= skip_benign or y_test[i] == 1.0:
        test_idx.append(i)
result_x_test = x_test[test_idx]
result_y_test = y_test[test_idx]

Explanation:

jiafengren commented 5 months ago
  • 因为我们只是丢弃了所有 MemoryObjects 和 UnnamedPipeObjects。我们只考虑文件、进程和网络流:
subject_type = pattern_type.findall(line)
if len(subject_type) < 1:
    if 'com.bbn.tc.schema.avro.cdm18.MemoryObject' in line:
        subject_type = 'MemoryObject'
    if 'com.bbn.tc.schema.avro.cdm18.NetFlowObject' in line:
        subject_type = 'NetFlowObject'
    if 'com.bbn.tc.schema.avro.cdm18.UnnamedPipeObject' in line:
        subject_type = 'UnnamedPipeObject'
else:
    subject_type = subject_type[0]

if uuid == '00000000-0000-0000-0000-000000000000' or subject_type in ['SUBJECT_UNIT']:
    continue
id_nodetype_map[uuid] = subject_type
if 'FILE' in subject_type and len(pattern_file_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_file_name.findall(line)[0]
elif subject_type == 'SUBJECT_PROCESS' and len(pattern_process_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_process_name.findall(line)[0]
elif subject_type == 'NetFlowObject' and len(pattern_netflow_object_name.findall(line)) > 0:
    id_nodename_map[uuid] = pattern_netflow_object_name.findall(line)[0]

Sorry for not seeing your reply in time. For MemoryObjects and UnnamedPipeObjects type nodes, it is noted that they are still used as entities in the provenance graph for training and testing, but are not identified as malicious nodes during testing. I would like to know what the basis for this is.

Jimmyokok commented 5 months ago

Node type consistency between training and test set is expected but we have to admit that the current setting is contrary to what we previously believe to be. MemoryObjects and UnnamedPipeObjects nodes are supposed to be removed from the training set.

Honestly speaking, I don't believe anybody is able to identify which MemoryObjects and UnnamedPipeObjects are exactly malicious from the DARPA TC groundtruth document, also I don't know how the ThreaTrace team managed to label them. But for the sake of consistency and for reference, we immediately performed two training-from-scratch evaluations on E3-Trace:

jiafengren commented 5 months ago

Thank you for your patient reply.