Open randomFriendlyGuy opened 3 years ago
Thanks for your interest! @randomFriendlyGuy, my apology for the late reply. Will get back to you by this week.
Hi @randomFriendlyGuy, my apology for the delayed reply.
"May I ask where the functions for training and validation come from?" If I remember correctly, the training and validation comes from a random subset of e.g., Binutils, Coreutils, Curl, Diffutils, Findutils, GMP, ImageMagick, Libmicrohttpd, LibTomCrypt, OpenSSL, PuTTy, SQLite, Zlib.
Seems the model overfits the training set a bit, but I would say training for 233 epochs are too much :-) 30 epochs look more reasonable to me. What do you think?
I have uploaded my early version of micro-tracing, hope it can help you understand.
To demonstrate how I generate the dataset for pretraining (you can adapt it for fine-tuning too).
First, obtain the corresponding byte for the function in a binary using objdump
. I have uploaded a script at command/pretrain/prepare_json.py
, which calls objdump
and extracts the bytes (delimited by space) for all binaries provided (included in the README: raw dataset).
Then based on the extracted bytes (should be in data-raw/funcbytes
), I use the micro-tracing to trace the function and obtain the micro-traces.
Note that if you are only generating function pairs for finetuning, you don't need to execute the function (the byte1-4 are all dummy, see our paper for details). So you can just use capstone (https://www.capstone-engine.org/lang_python.html) to parse the bytes and obtain the function pairs (e.g., randomly selected from a pool of functions in data-raw/funcbytes
). Having said that, you can still take a look at my micro-tracing code at micro_trace/prepare_code_trace.py
to see how I parse the output of assembly code to generate the final dataset for training (this is because both micro-trace's assembly code and capstone generated assembly code are in same format -- they share the same disassembler).
Feel free to talk to me if you find this is too long :-p thanks again for your interest!
Hello. I tried the commands you provided in README.md, and I am a bit confused about how to interpret the output. During running
./command/clr_multifield/finetune_any.sh
, it ran 233 epochs in total.After 30 epochs,
After 233 epochs
May I ask where the functions for training and validation come from? We could see the model has over 90% AUC on the validation set at the epoch=1, which is much better than the training set. After 233 epochs, it seems the model overfits on the training data, and the
valid_AUC
does not decreased heavily. And how could I preprocess the binary to get the data indata-src/clr_multifield_any
, and then runpython command/clr_multifield/preprocess_any.py
for further preprocessing?Thank you in advance. I am looking forward to your reply.