Closed andymeneely closed 10 years ago
Based on manual inspections we have done I believe only identifying c/c++ files are adequate. I would suggest also including .js
files in the mix considering a lot of the vulnerabilities were associated with javascript files as well.
Another way of determining this also would be to get a poll. For all vulnerable Filepaths what is the demographic of filetype extensions. I believe based on that, we would be able to make a more reasonable solution. Not to mention that a lot of the code to execute this already exists.
I agree. Once #104 is fixed, we should get that sample and consider it source code.
irb(main):016:0> s = Filepath.joins(commit_filepaths: [commit: [code_reviews: :cvenums]]).inject(Set.new){ |acc,rs| acc << rs.filepath[/\.\w*$/]}; s.each{|ext| puts ext}
.h
.js
.cc
.gypi
.grd
.gyp
.txt
.json
.py
.mm
.html
.cfg
.cpp
.css
.S
.chromium
.c
.checksum
.xib
.vcproj
.sln
.proto
.sb
.pbxproj
.vsprops
.conf
.pem
.sed
.scons
.saves
.make
.idl
.sh
.gitignore
.main
.xml
.patch
.port
I'm thinking that the following are source code:
.h
.cc
.js
.py
.S
.c
.make
.sh
Any that I missed?
Responding to this comment
Let's add .cpp to the list.
I'm not sure about .sb - could be an audio file, could be a Scratch file. We need to look up an example in the production data.
As for builds, looks like gyp
and .scons
is a lot like make, so that should be included. The .xib
file is just an xml definition of an iOS app interface, so that's not source code.
Latest list:
.h
.cc
.js
.py
.S
.c
.make
.sh
.cpp
.gyp
.scons
Another thing we need to do is to run this query to include any files WITHOUT an extension. For example Makefile
would be prominent and it's also source code.
Looks like .sb
is a Sandbox configuration - definitely source code in nature. Updated list:
.h
.cc
.js
.py
.S
.c
.make
.sh
.cpp
.gyp
.scons
.sb
Non-extension source code: Makefile
As for non-extension files, I ran these queries
SELECT * FROM release_filepaths WHERE thefilepath NOT LIKE '%\.%';
SELECT DISTINCT substring(thefilepath from '\/\w+$') file FROM release_filepaths WHERE thefilepath NOT LIKE '%\.%';
Second query had 126 rows - the only one I would add to the list would be Makefile. Everything else was documentation stuff. DEPS was everywhere but that's not really source code.
Looking into things more .idl, .proto, and .mm
might be source code in nature.
Any more info on those extensions? Can you point me to some examples in Chromium?
For our final analysis of aggregating metrics over filepaths, we only care about source code files. We care about other types of files as a part of our metrics, but ultimately we want to compare the same languages of files against each other.
Documentation files like html, png, etc. don't count as source code. So we'll need to flag the filepath in the database as either source code or not. But first, we need to define the extensions we consider to be source code:
I ran this query on the database to get all file extensions:
I got these results. Now clearly most of these are NOT source code. We can limit our study to the basic C and C++ code, but if we have vulnerabilities in other languages maybe that won't work.