The DataLoader is not working

Groupone128 commented 8 months ago

Snipaste_2024-02-29_23-46-32 The DataLoader section within the red box has some issues, it's not loading the data, instead, the code gets stuck for several hours and unable to continue running.

Groupone128 commented 8 months ago

And when I try to interrupt the program by clicking the stop button, it doesn't immediately terminate the program. Instead, it takes a few minutes after clicking the stop button for the program to terminate.

ViktorAxelsen commented 8 months ago

Please check your Python Library version described in the Environment Setup. Can you try to set a breakpoint or use print() to check where the script gets stuck?

Groupone128 commented 8 months ago

Please check your Python Library version described in the Environment Setup. Can you try to set a breakpoint or use print() to check where the script gets stuck?

Set num_workers=0, and it will run correctly.

Groupone128 commented 8 months ago

However, when I trained and tested with a non-VPN dataset, the results were very poor, and I'm not sure why. I compared the filenames in the 'cate' directory as you listed them, ensuring that the files before 'splitcap' were the same. I also performed an additional step before 'splitcap' to filter out cases such as empty payloads, retransmissions, and miscommunications, as described in your article:

'tshark -Y "(tcp.len > 0) && (!tcp.analysis.retransmission) && !(tcp.analysis.flags && !tcp.analysis.window_update)" -F pcap -r {} -w {}'

I'm not sure if this might have had an impact. Additionally, could you provide the 'train' and 'test' npz files that you used?

Groupone128 commented 8 months ago

And I haven't found the code for conducting "remove the Ethernet header".

ViktorAxelsen commented 8 months ago

Have you tried experiments on other datasets, e.g., ISCX-VPN? And please make sure that you only use TCP pcap files after SplitCap. You do not need to filter out packets or flows again, the code has completed this part of the processing.

This is a temporary link to the NPZ file that you can use for debugging https://drive.google.com/drive/folders/1Pkkt58xtLFe4B4OY1qpr4weeFEH2GjWX?usp=drive_link

jingbobuchi commented 8 months ago

Have you tried experiments on other datasets, e.g., ISCX-VPN? And please make sure that you only use TCP pcap files after SplitCap. You do not need to filter out packets or flows again, the code has completed this part of the processing.

This is a temporary link to the NPZ file that you can use for debugging https://drive.google.com/drive/folders/1Pkkt58xtLFe4B4OY1qpr4weeFEH2GjWX?usp=drive_link Hello, author. There is a problem with the data in this cloud disk. The number of images generated based on the npz file of this cloud disk is inconsistent. In particular, the number of load graphs and header graphs is inconsistent. Could you please check it. Thanks.

ViktorAxelsen commented 8 months ago

Sorry, you can just set "HEADER_BYTE_PAD_TRUNC_LENGTH = 50" in config.py, which has little effect on the results.

Groupone128 commented 8 months ago

Have you tried experiments on other datasets, e.g., ISCX-VPN? And please make sure that you only use TCP pcap files after SplitCap. You do not need to filter out packets or flows again, the code has completed this part of the processing.

This is a temporary link to the NPZ file that you can use for debugging https://drive.google.com/drive/folders/1Pkkt58xtLFe4B4OY1qpr4weeFEH2GjWX?usp=drive_link

Thank you for your help. With the data you provided, the detection performance has improved and is close to the experimental results in your article. I will further investigate what issues may have arisen during my preprocessing.

Groupone128 commented 8 months ago

I carefully inspected the code and observed the 'header_train.npz' file. I found that Ethernet packet headers still exist in it, contrary to what was stated in the article, where it was mentioned that 'remove the Ethernet header' Additionally, I noticed that packet headers may not correspond one-to-one with payloads. For example, in the depicted flow, the first three data packets have payloads of 0. In the 'header' list, the header data of the fourth data packet is saved in the fourth position, i.e., index 3, while in the 'payload' list, the payload data of the fourth data packet is saved in the first position, i.e., index 0. How will the MAC information in Ethernet packet headers and the misalignment between headers and payloads affect the training and recognition of the model?

ViktorAxelsen commented 8 months ago

Thank you for your feedback.

The Ethernet headers in TOR & VPN datasets are indeed not removed (The NonTor & NonVPN datasets seem to have no Ethernet header), and you can try to fix it.

The one-to-one matching issue you mentioned exists, and I've thought about it when coding in TFE-GNN. In TFE-GNN, we mainly conduct flow classification tasks, so the misalignment of a single packet does not affect the integrity of the information within the flow. This is an implementation choice. However, in CLE-TFE, it may be a bug because we also conduct packet classification tasks. You can try to fix it in your work.

But in general, if you and I use the same code, you shouldn't get such a big difference in results. Since you reproduced the results using the files I provided, I think the problem can only occur in all the preprocessing before constructing the byte-level traffic graph (especially pcap2npy.py). In my opinion, the version of the system package analysis software (e.g., t-shark, Wireshark, etc.) may have an impact. Unfortunately, I am unable to find out which version was in use at the time. But, you can follow this clue to debug.

Groupone128 commented 7 months ago

I followed your code unmodified, running pcapng2pcap.py, and splitting using 'SplitCap.exe -s session'. And the number of flows in train.npz is the same as that in the file you provided, while the number of flows in test.npz is one less than that in the file you provided. The results are very different, reaching acc 1.00 during training, but the recognition effect of categories 4 and 5 is very poor in testing.

ViktorAxelsen / TFE-GNN

The DataLoader is not working #4