IntelLabs / control-flag

A system to flag anomalous source code expressions by learning typical expressions from training data
MIT License
1.24k stars 111 forks source link

Segmentation fault while scan_for_anomalies.sh #15

Open qoega opened 2 years ago

qoega commented 2 years ago

Tried to check ClickHouse codebase, but it crashed. You can get ClickHouse codebase just from GitHub:

git clone git@github.com:ClickHouse/ClickHouse.git clickhouse
scripts/scan_for_anomalies.sh -d /home/qoega/clickhouse/src -t ./c_lang_if_stmts_6000_gitrepos.ts -o /home/qoega/control-flag/out/
Training: start.
Trie L1 build took: 1010.554s
Trie L2 build took: 487.217s
Training: complete.
Storing logs in /home/qoega/control-flag/out/
scripts/scan_for_anomalies.sh: line 84: 72697 Segmentation fault      ${SCRIPTS_DIR}/../bin/cf_file_scanner -t ${TRAIN_FILE} -s ${SCAN_FILE_LIST} -c ${MAX_AUTOCORRECT_COST} -n ${MAX_AUTOCORRECT_RESULTS} -j ${NUM_SCAN_THREADS} -o ${OUTPUT_DIR} -a ${ANOMALY_THRESHOLD} -l ${LANGUAGE}

PS: c_lang_if_stmts_6000_gitrepos.ts was trained on C projects only or C++ as well? Did not find https://github.com/ClickHouse/ClickHouse in C++ projects list. It is written in C++ and has 20K stars/800 contributors.

nhasabni commented 2 years ago

hi @qoega,

Thanks for trying out ControlFlag. c_lang_if_stmts_6000_gitrepos.ts is the dataset generated using repositories using C as a primary language. It should work for scanning projects using C++ language also. Although, it is more effective for scanning projects using C as their primary language.

I will try to reproduce the crash on my end. Just wanted to let you know that we have also released smaller training datasets for limited-memory devices (Although, memory capacity does not appear to be the issue behind this crash.)

xback commented 2 years ago

I also encounter this bug. What is the current status regarding this one?

Thank you

nhasabni commented 2 years ago

Hi @xback,

Thanks for trying out ControlFlag. Did you try using a smaller version of the dataset? We have seen that most of these crash bugs are because of using larger datasets than the available memory on the system. Thanks.

xback commented 2 years ago

Hi @xback,

Thanks for trying out ControlFlag. Did you try using a smaller version of the dataset? We have seen that most of these crash bugs are because of using larger datasets than the available memory on the system. Thanks.

Hi, The test ran on a system with 1TB of RAM (really) of which >900GB was free.

nhasabni commented 2 years ago

Hi @xback, Thanks for trying out ControlFlag. Did you try using a smaller version of the dataset? We have seen that most of these crash bugs are because of using larger datasets than the available memory on the system. Thanks.

Hi, The test ran on a system with 1TB of RAM (really) of which >900GB was free.

Thanks for info, @xback. Let us look into reproducing the issue. Would you mind pointing us the repository that you have been scanning using ControlFlag (if it is a public repository)? That can help us expedite the process. Thanks.

xback commented 2 years ago

Would you mind pointing us the repository that you have been scanning using ControlFlag (if it is a public repository)?

Unfortunately, the repo is not public but I'll try to provide more details or a reproducer

nhasabni commented 2 years ago

Hi @xback, Thanks for trying out ControlFlag. Did you try using a smaller version of the dataset? We have seen that most of these crash bugs are because of using larger datasets than the available memory on the system. Thanks.

Hi, The test ran on a system with 1TB of RAM (really) of which >900GB was free.

Thanks for info, @xback. Let us look into reproducing the issue. Would you mind pointing us the repository that you have been scanning using ControlFlag (if it is a public repository)? That can help us expedite the process. Thanks.

Hi @xback, we scanned ClickHouse code using large version of the dataset, and the scan finished without any issues. In short, we do not see crash on our end. Please provide us a reproducer as per your convenience. Thanks.