UTSA-cyber / sceadan

Systematic Classification Engine for Advanced Data ANalysis
GNU General Public License v2.0
22 stars 7 forks source link

Sceadan v 1.2.1

Sceadan (pronounced "scee-den") is a Systematic Classification Engine for Advanced Data ANalysis tool.

Sceadan is originally Old English / Proto-Germanic for "to classify."

Copyright (c) 2012-2013 The University of Texas at San Antonio

License: GPLv2

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Written by:

Dr. Nicole Beebe (nicole.beebe@utsa.edu) and Lishu Liu, Department of Information Systems and Cyber Security, and Laurence Maddox, Department of Computer Science, University of Texas at San Antonio, One UTSA Circle, San Antonio, Texas 78209

Simson L. Garfinkel (simsong@acm.org) Naval Postgraduate School Arlington, VA 22201

Purpose

Sceadan is a C module that uses libsvm to determine the "type" of bulk data based on file content, without using magic numbers, internal data structures, or file system metadata. Currently this is done by extracting features from the bulk data and applying a statical model based on a support vector machine. Current veatures include unigram counts, bigram counts, and other statistical measures.

The sceadan module ships with an application called sceadan_app that demonstrates the use of the program.

Layout

$ROOT/DATA - The training data. Extension is assumed to be the file type. Can contain subdirectories $ROOT/src - Sceadan source code $ROOT/src/sceadan_app - generates vectors $ROOT/tools/ - This directory $ROOT/tools/Makefile - runs all training $ROOT/tools/sceadan_train.py - Python program run by Makefile

Build Instructions

./configure
make

Usage

src/sceadan_app [options] <target> 
    python3 tools/sceadan_train.py <args>

If target is a directory, Sceadan will operate on each file within the directory individual, otherwise Sceadan will classify the single file referenced.

Training Guide

Sceadan comes with a pre-trained model that is compiled into the program. The model is generated by liblinear and then compiled into C code with the program src/mcompile.cpp. This produces a .c output file called src/sceadan_model_precompiled.c which is compiled by the C compiler.

However, you may wish to train your own model. For example, you can:

To train sceadan you will use the files in the tools/ directory. Please see the file doc/training_procedure.md for further information.

Machine Learning Science

Sceadan uses the default support vector machine model:

This version was statistically trained to identify 48 of the following file/data types:

TEXT = 1, /* file type description: Plain text file extensions: .text, .txt */
CSV = 2, /* file type description: Delimited file extensions: .csv */
LOG = 3, /* file type description: Log files file extensions: .log */
HTML = 4, /* file type description: HTML file extensions: .html */
XML = 5, /* file type description: xml file extensions: .xml */
JSON = 7, /* file type description: JSON records file extensions: .json */
JS = 8, /* file type description: JavaScript code file extensions: .js */
JAVA = 9, /* file type description: Java Source Code file extensions: .java */
CSS = 10, /* file type description: css file extensions: .css */
B64 = 11, /* file type description: Base64 encoding file extensions: .b64 */
A85 = 12, /* file type description: Base85 encoding file extensions: .a85 */
B16 = 13, /* file type description: Hex encoding file extensions: .b16 */
URL = 14, /* file type description: URL encoding file extensions: .urlencoded */
RTF = 16, /* file type description: Rich Text File file extensions: .rtf */
TBIRD = 17, /* file type description: Thunderbird Mail Files (data and index) file extensions: .msf and no extension */
PST = 18, /* file type description: MS Outlook PST files file extensions: .pst */
PNG = 19, /* file type description: Portable Network Graphic file extensions:  */
GIF = 20, /* file type description: GIF file extensions: .gif */
TIF = 21, /* file type description: Bi-tonal images file extensions: .tif, .tiff */
JB2 = 22, /* file type description: JBIG2 file extensions: .jb2 */
GZ = 23, /* file type description: ZLIB - DEFLATE compression file extensions: .gz, .gzip, .tgz, .z, .taz */
ZIP = 24, /* file type description: ZLIB - DEFLATE compression file extensions: .zip */
BZ2 = 27, /* file type description: BZ2 file extensions: .bz, .tbz, .bz2, bzip2, .tbz2 */
PDF = 28, /* file type description: PDF file extensions: .pdf */
DOCX = 29, /* file type description: MS-DOCX file extensions: .docx */
XLSX = 30, /* file type description: MS-XLSX file extensions: .xlsx */
PPTX = 31, /* file type description: MS-PPTX file extensions: .pptx */
JPG = 32, /* file type description: JPG file extensions: .jpg, .jpeg */
MP3 = 33, /* file type description: MP3 file extensions: .mp3 */
M4A = 34, /* file type description: AAC file extensions: .m4a */
MP4 = 35, /* file type description: H264 file extensions: .mp4 */
AVI = 36, /* file type description: AVI file extensions: .avi */
WMV = 37, /* file type description: WMV file extensions: .wmv */
FLV = 38, /* file type description: FLV file extensions: .flv */
WAV = 40, /* file type description: Windows Audio File file extensions: .wav */
MOV = 42, /* file type description: Apple Quicktime Move file extensions: .mov */
DOC = 43, /* file type description: MS-DOC file extensions: .doc */
XLS = 44, /* file type description: MS-XLS file extensions: .xls */
PPT = 45, /* file type description: MS-PPT file extensions: .ppt */
FAT = 46, /* file type description: FS-FAT file extensions: .fat */
NTFS = 47, /* file type description: FS-NTFS file extensions: .ntfs */
EXT3 = 48, /* file type description: FS-EXT file extensions: .ext3 */
EXE = 49, /* file type description: Windows Portable Executables (PE) file extensions: .exe */
DLL = 50, /* file type description: Windows Dynamic Link Library files file extensions: .dll */
ELF = 51, /* file type description: Linux Executables file extensions: .elf */
BMP = 52, /* file type description: Bitmap file extensions: .bmp */

This version of Sceadan detects the following types by rule or pattern:

This version has support for enabling advanced feature extraction and usage in future versions and with use of other model files.

NOTE: The formulas contained in the code are not verified as correct and should be used only with extreme caution.

Such features include:

Should advanced features be enabled, the following dependencies will be required:

Modes of operation:

Bugs and Platforms Tested On

This version will build on Linux and Mac environments.

This version calculates hamming weight and entropy values incorrectly. This has no impact on the default operation, since Sceadan does not use these features in this version's prediction model.

Where to find things

You'll want to be familiar with:

Acknowledgments

Foundational research, design, and project management by Nicole L. Beebe, Ph.D. at The University of Texas at San Antonio (UTSA)

Software development by Laurence Maddox and Lishu Liu, UTSA.

Experimentation by Lishu Liu, UTSA.

Special thanks to DJ Bauch, SAIC, for software development assistance.

Special thanks to Minghe Sun, Ph.D., UTSA for support vector machine assistance.

Special thanks to Matt Beebe for tool naming inspiration.

This research was supported in part by a research grant from the U.S. Naval Postgraduate School (Grant No. N00244-11-1-0011), and in part by funding from the UTSA Provost Summer Research Mentorship program.