How to interpret the output result?

randomFriendlyGuy commented 3 years ago

Hello. I tried the commands you provided in README.md, and I am a bit confused about how to interpret the output. During running ./command/clr_multifield/finetune_any.sh, it ran 233 epochs in total.

INFO | train | {"epoch": 1, "train_loss": "0.378", "train_nll_loss": "0.001", "train_AUC": "0.6845", "train_wps": "24626.6", "train_ups": "2.04", "train_wpb": "12075.3", "train_bsz": "31.8", "train_num_updates": "129", "train_lr": "9.9903e-06", "train_gnorm": "1.397", "train_clip": "0", "train_oom": "0", "train_loss_scale": "128", "train_train_wall": "62", "train_ppl": "1", "train_wall": "65"}
 INFO | valid | {"epoch": 1, "valid_loss": "0.164", "valid_nll_loss": "0", "valid_AUC": "0.9091", "valid_wps": "66378.2", "valid_wpb": "1555.1", "valid_bsz": "4", "valid_ppl": "1", "valid_num_updates": "129"}

After 30 epochs,

INFO | train | {"epoch": 30, "train_loss": "0.07", "train_nll_loss": "0", "train_AUC": "0.9983", "train_wps": "24305.1", "train_ups": "2.02", "train_wpb": "12075.3", "train_bsz": "31.8", "train_num_updates": "3870", "train_lr": "8.73913e-06", "train_gnorm": "0.997", "train_clip": "0", "train_oom": "0", "train_loss_scale": "256", "train_train_wall": "62", "train_ppl": "1", "train_wall": "2980"}
INFO | valid | {"epoch": 30, "valid_loss": "0.132", "valid_nll_loss": "0", "valid_AUC": "0.9762", "valid_wps": "65559.2", "valid_wpb": "1555.1", "valid_bsz": "4", "valid_ppl": "1", "valid_num_updates": "3870", "valid_best_AUC": "0.9762"}

After 233 epochs

INFO | train | {"epoch": 233, "train_loss": "0.01", "train_nll_loss": "0", "train_AUC": "1", "train_wps": "17285.1", "train_ups": "1.43", "train_wpb": "12045.1", "train_bsz": "32", "train_num_updates": "30000", "train_lr": "0", "train_gnorm": "0.436", "train_clip": "0", "train_oom": "0", "train_loss_scale": "12547", "train_train_wall": "53", "train_ppl": "1", "train_wall": "26774"}
INFO | valid | {"epoch": 233, "valid_loss": "0.156", "valid_nll_loss": "0", "valid_AUC": "0.9669", "valid_wps": "50341.1", "valid_wpb": "1555.1", "valid_bsz": "4", "valid_ppl": "1", "valid_num_updates": "30000", "valid_best_AUC": "0.9762"}

May I ask where the functions for training and validation come from? We could see the model has over 90% AUC on the validation set at the epoch=1, which is much better than the training set. After 233 epochs, it seems the model overfits on the training data, and the valid_AUC does not decreased heavily. And how could I preprocess the binary to get the data in data-src/clr_multifield_any, and then run python command/clr_multifield/preprocess_any.py for further preprocessing?

Thank you in advance. I am looking forward to your reply.

peikexin9 commented 3 years ago

Thanks for your interest! @randomFriendlyGuy, my apology for the late reply. Will get back to you by this week.

peikexin9 commented 3 years ago

Hi @randomFriendlyGuy, my apology for the delayed reply.

"May I ask where the functions for training and validation come from?" If I remember correctly, the training and validation comes from a random subset of e.g., Binutils, Coreutils, Curl, Diffutils, Findutils, GMP, ImageMagick, Libmicrohttpd, LibTomCrypt, OpenSSL, PuTTy, SQLite, Zlib.

Seems the model overfits the training set a bit, but I would say training for 233 epochs are too much :-) 30 epochs look more reasonable to me. What do you think?

I have uploaded my early version of micro-tracing, hope it can help you understand.

To demonstrate how I generate the dataset for pretraining (you can adapt it for fine-tuning too).

First, obtain the corresponding byte for the function in a binary using objdump. I have uploaded a script at command/pretrain/prepare_json.py, which calls objdump and extracts the bytes (delimited by space) for all binaries provided (included in the README: raw dataset).

Then based on the extracted bytes (should be in data-raw/funcbytes), I use the micro-tracing to trace the function and obtain the micro-traces.

Note that if you are only generating function pairs for finetuning, you don't need to execute the function (the byte1-4 are all dummy, see our paper for details). So you can just use capstone (https://www.capstone-engine.org/lang_python.html) to parse the bytes and obtain the function pairs (e.g., randomly selected from a pool of functions in data-raw/funcbytes). Having said that, you can still take a look at my micro-tracing code at micro_trace/prepare_code_trace.py to see how I parse the output of assembly code to generate the final dataset for training (this is because both micro-trace's assembly code and capstone generated assembly code are in same format -- they share the same disassembler).

Feel free to talk to me if you find this is too long :-p thanks again for your interest!

CUMLSec / trex

How to interpret the output result? #4