andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

Which of these are source code? #114

Closed andymeneely closed 10 years ago

andymeneely commented 10 years ago

For our final analysis of aggregating metrics over filepaths, we only care about source code files. We care about other types of files as a part of our metrics, but ultimately we want to compare the same languages of files against each other.

Documentation files like html, png, etc. don't count as source code. So we'll need to flag the filepath in the database as either source code or not. But first, we need to define the extensions we consider to be source code:

I ran this query on the database to get all file extensions:

SELECT extension, count(*) num FROM 
  (SELECT filepath, substring(filepath from '\.\w*$' ) as extension from filepaths) as file_exts 
  GROUP BY extension
  ORDER BY num DESC

I got these results. Now clearly most of these are NOT source code. We can limit our study to the basic C and C++ code, but if we have vulnerabilities in other languages maybe that won't work.

      extension       |  num
----------------------+-------
 .cc                  | 40282
 .h                   | 38434
 .png                 | 19231
 .txt                 | 16787
 .checksum            | 15236
 .html                |  8266
 .py                  |  7136
 .js                  |  5910
                      |  5189
 .c                   |  5125
 .json                |  3265
 .mm                  |  2508
 .xtb                 |  1714
 .cpp                 |  1421
 .pm                  |  1353
 .test                |  1247
 .pl                  |  1105
 .java                |  1102
 .css                 |   869
 .idl                 |   789
 .gyp                 |   686
 .jpg                 |   549
 .sh                  |   451
 .gypi                |   435
 .pod                 |   391
 .hpp                 |   391
 .al                  |   372
 .xml                 |   343
 .scons               |   317
 .vcproj              |   278
 .chromium            |   271
 .mk                  |   212
 .in                  |   212
 .proto               |   205
 .patch               |   204
 .pem                 |   185
 .out                 |   172
 .t                   |   171
 .rc                  |   160
 .gif                 |   158
 .dll                 |   156
 .onc                 |   140
 .vsprops             |   138
 .xib                 |   137
 .nmf                 |   137
 .zip                 |   132
 .crx                 |   129
 .grd                 |   129
 .sql                 |   118
 .mn                  |   117
 .cfg                 |   113
 .sha1                |   102
 .svg                 |   101
 .dic                 |    99
 .bat                 |    98
 .aff                 |    97
 .dsc                 |    87
 .am                  |    87
 .webm                |    80
 .m                   |    76
 .def                 |    76
 .htm                 |    72
 .pdf                 |    72
 .plist               |    71
 .tcl                 |    69
 .gn                  |    68
 .S                   |    65
 .packlist            |    63
 .m4                  |    63
 .good                |    61
 .template            |    56
 .rst                 |    56
 .srpc                |    54
 .google              |    54
 .shader              |    54
 .expected            |    53
 .cg                  |    51
 .hh                  |    49
 .cvsignore           |    47
 .isolate             |    46
 .mp4                 |    46
 .wrong               |    45
 .gitignore           |    43
 .s                   |    42
 .ogg                 |    42
 .so                  |    42
 .cs                  |    41
 .jar                 |    41
 .cur                 |    40
 .tmpl                |    40
 .exe                 |    37
 .pbxproj             |    37
 .1                   |    35
 .vert                |    34
 .la                  |    34
 .ico                 |    33
 .3pm                 |    33
 .sln                 |    33
 .wav                 |    33
 .manifest            |    32
 .glsl                |    32
 .ver                 |    32
 .bdic                |    31
 .rules               |    31
 .dat                 |    31
 .rgs                 |    31
 .tex                 |    30
 .applescript         |    30
 .man                 |    29
 .frag                |    28
 .tga                 |    26
 .ix                  |    25
 .spec                |    25
 .3ds                 |    24
 .asm                 |    23
 .der                 |    23
 .sb                  |    22
 .webp                |    22
 .rfx                 |    22
 .prop                |    22
 .yaml                |    21
 .hyph                |    21
 .inc                 |    21
 .word                |    21
 .pat                 |    21
 .dic_delta           |    21
 .cgi                 |    20
 .crt                 |    19
 .hxx                 |    18
 .dylib               |    18
 .ogv                 |    17
 .vcxproj             |    16
 .dox                 |    16
 .pump                |    16
 .exists              |    16
 .filters             |    16
 .md                  |    16
 .ac                  |    15
 .php                 |    15
 .conf                |    15
 .csv                 |    15
 .xsl                 |    15
 .guess               |    15
 .user                |    15
 .glade               |    15
 .xhtml               |    15
 .cnf                 |    15
 .log                 |    15
 .sub                 |    15
 .dart                |    14
 .diff                |    14
 .pxi                 |    14
 .gen                 |    14
 .cxx                 |    14
 .gtestjs             |    13
 .orig                |    13
.install             |    13
 .version             |    13
 .mms                 |    13
 .tcc                 |    13
 .pak                 |    13
 .nexe                |    13
 .jinja2              |    13
 .sug                 |    12
 .awk                 |    12
 .p12                 |    12
 .yml                 |    12
 .dsp                 |    12
 .e2x                 |    12
 .fx                  |    12
 .bin                 |    12
 .2                   |    12
 .pdb                 |    11
 .pbl                 |    10
 .icns                |    10
 .aidl                |    10
 .bmp                 |    10
 .wxs                 |    10
 .resx                |    10
 .results             |    10
 .reg                 |    10
 .README              |    10
 .raw                 |    10
 .grdp                |     9
 .3                   |     9
 .pch                 |     9
 .ini                 |     9
 .pyx                 |     9
 .stdin               |     9
 .gperf               |     9
 .db                  |     9
 .coffee              |     9
 .croc                |     9
 .rake                |     9
 .y                   |     8
 .stdout              |     8
 .xcconfig            |     8
 .packproj            |     8
 .rtf                 |     8
 .include             |     8
 .TXT                 |     8
 .o3d                 |     8
 .o                   |     8
 .5                   |     7
 .win                 |     7
 .nc                  |     7
 .gc                  |     7
 .doc                 |     7
 .rb                  |     7
 .dot                 |     7
 .yuv                 |     7
 .avi                 |     7
 .strings             |     7
 .md5                 |     7
 .l                   |     7
 .morph               |     7
 .cmd                 |     7
 .sed                 |     6
 .settings            |     6
 .exp                 |     6
 .tiff                |     6
 .flags               |     6
 .graffle             |     6
 .com                 |     6
 .lib                 |     6
 .4                   |     6
 .pac                 |     6
 .dirs                |     6
 .pol                 |     5
 .gz                  |     5
 .jpeg                |     5
 .syn                 |     5
 .r                   |     5
 .sqlite              |     5
 .tests               |     5
 .config              |     5
 .pexe                |     5
 .csproj              |     5
 .dtd                 |     5
 .imports             |     5
 .g                   |     5
 .sigs                |     5
 .deps                |     5
 .bz2                 |     5
 .ld                  |     5
 .msvc                |     5
 .properties          |     5
 .il                  |     5
 .pth                 |     5
 .targ                |     5
 .po                  |     5
 .keep                |     5
 .tab                 |     5
 .h264                |     5
 .eg                  |     4
 .mov                 |     4
 .mp3                 |     4
 .msg                 |     4
 .explain             |     4
 .xul                 |     4
 .nib                 |     4
 .lst                 |     4
 .swf                 |     4
 .a                   |     4
 .3gp                 |     4
 .key                 |     4
 .src                 |     4
 .dds                 |     4
 .obs                 |     4
 .ksh                 |     4
 .opt                 |     4
 .certs               |     4
 .7z                  |     4
 .bash                |     4
 .dsw                 |     4
 .rgb                 |     4
 .perl                |     4
 .go                  |     4
 .psd                 |     4
 .tokenizers          |     4
 .hidden              |     4
 .info                |     4
 .saves               |     3
 .el                  |     3
 .tgz                 |     3
 .old                 |     3
 .flac                |     3
 .common              |     3
 .message             |     3
 .pc                  |     3
 .release             |     3
 .ai                  |     3
 .mgw                 |     3
 .adm                 |     3
 .mingw               |     3
 .bak                 |     3
 .hlsl                |     3
 .emf                 |     3
 .eps                 |     3
 .make                |     3
 .mak                 |     3
 .pkcs8               |     3
 .list                |     3
 .pkgproj             |     3
 .LIGHTTPD            |     3
 .PL                  |     3
 .pxd                 |     3
 .woff                |     3
 .mojom               |     3
 .tac                 |     3
 .OPENSSL             |     3
 .localstorage        |     3
 .utf8                |     3
 .x                   |     3
 .xsd                 |     3
 .0                   |     3
 .i                   |     3
 .p7b                 |     3
 .LIB                 |     3
 .prefs               |     3
 .fragment            |     3
 .rc_template         |     2
 .class               |     2
 .heap                |     2
 .ino                 |     2
 .inputs              |     2
 .xsx                 |     2
 .installers          |     2
 .googleurl           |     2
 .xml1                |     2
 .xml0                |     2
 .gni                 |     2
 .jni                 |     2
 .collada_edge        |     2
 .dmg                 |     2
 .gitmodules          |     2
 .keystore            |     2
 .win32               |     2
 .buckets             |     2
 .LGPL                |     2
 .vxworks             |     2
 .libnet              |     2
 .vsixmanifest        |     2
 .links               |     2
 .converter           |     2
 .lnk                 |     2
 .converter_edge      |     2
 .vsct                |     2
 .vs                  |     2
 .profile             |     2
 .git                 |     2
 .cron                |     2
 .vim                 |     2
 .m4a                 |     2
 .main                |     2
 .Makefile            |     2
 .browser             |     2
 .uuid                |     2
 .uu                  |     2
 .manpages            |     2
 .mc                  |     2
 .fs                  |     2
 .types               |     2
 .mgp                 |     2
 .mht                 |     2
 .two                 |     2
 .9                   |     2
 .ts                  |     2
 .                    |     2
 .three               |     2
 .templ               |     2
 .exports             |     2
 .MPL                 |     2
 .myspell             |     2
 .api                 |     2
 .syntax              |     2
 .darwin              |     2
 .nmake               |     2
 .8                   |     2
 .data                |     2
 .notpy               |     2
 .bcb                 |     2
 .52                  |     2
 .nsproxy             |     2
 .sst                 |     2
 .dd                  |     2
 .6                   |     2
 .def_template        |     2
 .snk                 |     2
 .one                 |     2
 .SKIP                |     2
 .original            |     2
 .7                   |     2
 .sha512              |     2
 .ebuild              |     2
 .ppd                 |     2
 .sdef                |     2
 .rgs_template        |     2
 .pbr                 |     2
 .rep                 |     2
 .pcm                 |     2
 .rej                 |     2
 .rdf                 |     2
 .idl_template        |     2
 .aspx                |     2
 .PIXExp              |     2
 .pkg                 |     2
 .dri                 |     2
 .ib_ini              |     2
 .htp                 |     2
 .tpl                 |     1
 .3DFX                |     1
 .50                  |     1
 .abc                 |     1
 .ac3                 |     1
 .adb                 |     1
 .adml                |     1
 .admx                |     1
 .adts                |     1
 .aiff                |     1
 .amd64               |     1
 .AMIWIN              |     1
 .apk                 |     1
 .appcache            |     1
 .asf                 |     1
 .automated_ui_tests  |     1
 .BEOS                |     1
 .Borland             |     1
 .bsdiff              |     1
 .c2                  |     1
 .cache               |     1
 .canonical           |     1
 .cc_ZLIB             |     1
 .cdd                 |     1
 .checker_innocent    |     1
 .chromium_os         |     1
 .classpath           |     1
 .client              |     1
 .compound            |     1
 .Config              |     1
 .cppclean            |     1
 .custom              |     1
 .CV                  |     1
 .CYGWIN              |     1
 .D3D                 |     1
 .dae                 |     1
 .dblite              |     1
 .Debian              |     1
 .deprecated          |     1
 .dia                 |     1
 .dib                 |     1
 .directfb            |     1
 .disable             |     1
 .DJ                  |     1
 .dsoexample          |     1
 .DS_Store            |     1
 .eac3                |     1
 .egg                 |     1
 .erb                 |     1
 .et                  |     1
 .EX_                 |     1
 .export              |     1
 .flv                 |     1
 .FP                  |     1
 .g3pl                |     1
 .GGI                 |     1
 .gitattributes       |     1
 .gmock               |     1
 .GOOGLE              |     1
 .gpd                 |     1
 .gpsd                |     1
 .grp                 |     1
 .gtest               |     1
 .gyps                |     1
 .h261                |     1
 .h263                |     1
 .handlebars          |     1
 .harness             |     1
 .h_old               |     1
 .hunspell            |     1
 .hyphen              |     1
 .i386                |     1
 .ids                 |     1
 .if                  |     1
 .iml                 |     1
 .imm                 |     1
 .init                |     1
 .javascriptcore_pcre |     1
 .jfif                |     1
 .jpe                 |     1
 .JPG                 |     1
 .libbreakpad_osx     |     1
 .libgd               |     1
 .libjpeg             |     1
 .libmozjs            |     1
 .libpng              |     1
 .linux               |     1
 .lock                |     1
 .lpp                 |     1
 .LYNXOS              |     1
 .m2ts                |     1
 .m2v                 |     1
 .Makefiles           |     1
 .map                 |     1
 .maps                |     1
 .md5sum              |     1
 .menu                |     1
 .MINGW32             |     1
 .mini                |     1
 .mirror              |     1
 .MITS                |     1
 .mjpeg               |     1
 .mpeg                |     1
 .naclports           |     1
 .ncb                 |     1
 .NDK                 |     1
 .NeXT                |     1
 .nm                  |     1
 .NONPORTABLE         |     1
 .nonstandard         |     1
 .nsh                 |     1
 .nsi                 |     1
 .o3dtgz              |     1
 .obj                 |     1
 .ods                 |     1
 .odt                 |     1
 .official            |     1
 .Old                 |     1
 .OpenStep            |     1
 .order               |     1
 .os2                 |     1
 .OS2                 |     1
 .ots                 |     1
 .pam                 |     1
 .pdefs               |     1
 .pft                 |     1
 .policy              |     1
 .port                |     1
 .portaudio           |     1
 .ppt                 |     1
 .pro                 |     1
 .project             |     1
 .props               |     1
 .pub                 |     1
 .PY                  |     1
 .pyd                 |     1
 .pyw                 |     1
 .QUAKE               |     1
 .r70                 |     1
 .rl                  |     1
 .rm                  |     1
.ods                 |     1
 .odt                 |     1
 .official            |     1
 .Old                 |     1
 .OpenStep            |     1
 .order               |     1
 .os2                 |     1
 .OS2                 |     1
 .ots                 |     1
 .pam                 |     1
 .pdefs               |     1
 .pft                 |     1
 .policy              |     1
 .port                |     1
 .portaudio           |     1
 .ppt                 |     1
 .pro                 |     1
 .project             |     1
 .props               |     1
 .pub                 |     1
 .PY                  |     1
 .pyd                 |     1
 .pyw                 |     1
 .QUAKE               |     1
 .r70                 |     1
 .rl                  |     1
 .rm                  |     1
 .rpc                 |     1
 .see_also            |     1
 .self                |     1
 .server              |     1
 .signatures          |     1
 .sl                  |     1
 .Solaris             |     1
 .source              |     1
 .sql_disable         |     1
 .status              |     1
 .stderr              |     1
 .StyleCop            |     1
 .suo                 |     1
 .supp                |     1
 .swp                 |     1
 .symbols             |     1
 .tar                 |     1
 .tbx                 |     1
 .te                  |     1
 .THREADS             |     1
 .tif                 |     1
 .tml                 |     1
 .trans               |     1
 .TTF                 |     1
 .ui_tests            |     1
 .unit_tests          |     1
 .unix                |     1
 .unknownextension    |     1
 .url                 |     1
 .v8                  |     1
 .vanilla             |     1
 .verifier            |     1
 .vm                  |     1
 .VMS                 |     1
 .vp8                 |     1
 .Watcom              |     1
 .WIN32               |     1
 .WinCE               |     1
 .WINDML              |     1
 .windows             |     1
 .wpr                 |     1
 .wreck               |     1
 .X                   |     1
 .xbm                 |     1
 .xkcd                |     1
 .xls                 |     1
 .xorg                |     1
 .xpm                 |     1
 .ypp                 |     1
 .zlib                |     1
cketant commented 10 years ago

Based on manual inspections we have done I believe only identifying c/c++ files are adequate. I would suggest also including .js files in the mix considering a lot of the vulnerabilities were associated with javascript files as well.

Another way of determining this also would be to get a poll. For all vulnerable Filepaths what is the demographic of filetype extensions. I believe based on that, we would be able to make a more reasonable solution. Not to mention that a lot of the code to execute this already exists.

andymeneely commented 10 years ago

I agree. Once #104 is fixed, we should get that sample and consider it source code.

andymeneely commented 10 years ago

104 isn't fixed yet, so this is still preliminary. But here's a query on the vulnerable filepaths on the current data set:

irb(main):016:0> s = Filepath.joins(commit_filepaths: [commit: [code_reviews: :cvenums]]).inject(Set.new){ |acc,rs| acc << rs.filepath[/\.\w*$/]}; s.each{|ext| puts ext}
.h
.js

.cc
.gypi
.grd
.gyp
.txt
.json
.py
.mm
.html
.cfg
.cpp
.css
.S
.chromium
.c
.checksum
.xib
.vcproj
.sln
.proto
.sb
.pbxproj
.vsprops
.conf
.pem
.sed
.scons
.saves
.make
.idl
.sh
.gitignore
.main
.xml
.patch
.port

I'm thinking that the following are source code:

.h
.cc
.js
.py
.S
.c
.make
.sh

Any that I missed?

andymeneely commented 10 years ago

Responding to this comment

Let's add .cpp to the list.

I'm not sure about .sb - could be an audio file, could be a Scratch file. We need to look up an example in the production data.

As for builds, looks like gyp and .scons is a lot like make, so that should be included. The .xib file is just an xml definition of an iOS app interface, so that's not source code.

Latest list:

.h
.cc
.js
.py
.S
.c
.make
.sh
.cpp
.gyp
.scons
andymeneely commented 10 years ago

Another thing we need to do is to run this query to include any files WITHOUT an extension. For example Makefile would be prominent and it's also source code.

andymeneely commented 10 years ago

Looks like .sb is a Sandbox configuration - definitely source code in nature. Updated list:

.h
.cc
.js
.py
.S
.c
.make
.sh
.cpp
.gyp
.scons
.sb

Non-extension source code: Makefile

As for non-extension files, I ran these queries

SELECT * FROM release_filepaths WHERE thefilepath NOT LIKE '%\.%';
SELECT DISTINCT substring(thefilepath from '\/\w+$') file FROM release_filepaths WHERE thefilepath NOT LIKE '%\.%';

Second query had 126 rows - the only one I would add to the list would be Makefile. Everything else was documentation stuff. DEPS was everywhere but that's not really source code.

kaylaerdmann commented 10 years ago

Looking into things more .idl, .proto, and .mm might be source code in nature.

andymeneely commented 10 years ago

Any more info on those extensions? Can you point me to some examples in Chromium?