andymeneely / httpd-history

An historical analysis of HTTPD and its vulnerabilities
3 stars 3 forks source link

Add ComponentChurn to the GitLogFiles table #27

Closed andymeneely closed 11 years ago

andymeneely commented 11 years ago

Update the GitLogFiles table to have the aggregated churn metrics for a given file for that file's component. A "component" is roughly the directory that a file is in, which varies depending on the meaning.

You'll need to infer what module a given file is in. This might take some research into Apache's architecture a bit, but from a cursory glance it looks like these are the main modules:

Not sure how to handle srclib

Not sure how to handle experimental. Is that one module? Things migrate out of there into their own modules eventually anyway.

Not sure how to handle includes for this one, though - so figure that out.

Related to Issue #5 and #21.

amusa commented 11 years ago

Couple of issues:

  1. from my research, httpd modules are divided into 'core feature modules' (core, mpm_common, prefork, worker, etc) and 'other modules' (mod_actions, mod_alias, mod_auth_basic, etc), and almost all these modules consists of a single file.
  2. modules are categorized according to 'status' (core, MPM, Base, Extension, Experimental) which is an indication of 'how tightly bound into the Apache Web server the directive is'. If Git can provide the status of a file, we could possibly use the 'status'.
  3. Is 'ComponentChurn' also aggregate for the last 30 days (for example) or just aggregate?
andymeneely commented 11 years ago

Hm - that's a bit unexpected. I knew they had lots of modules, but I didn't realize that they were all literally one source file. Are you sure? When I browse the directory tree, for example, I see subdirectories under modules like "http" and "ldap" that have multiple source files. Maybe we can consider that to be one component?

Is there any way to tell if a module is in the core or not? Is it in the filepath? Do they record it on their website? Git doesn't keep design information, just file information.

Yes, component churn is also aggregated over the last 30 days (configurable, as before).

amusa commented 11 years ago

Yes, am sure! for example a module like 'mod_alias' has one source file 'mod_alias.c' (http://httpd.apache.org/docs/2.2/mod/mod_alias.html). Certainly what what they call 'modules' is not entirely consistent with the file structure. If you search 'gitlogfiles' for files containing 'mod_alias' for example you'll see the following files: 'modules/mappers/mod_actions.c' 'modules/mappers/mod_alias.c' 'modules/mappers/mod_dir.c' 'modules/mappers/mod_imagemap.c' 'modules/mappers/mod_imap.c' 'modules/mappers/mod_negotiation.c' 'modules/mappers/mod_rewrite.c' 'modules/mappers/mod_rewrite.h' 'modules/mappers/mod_so.c' 'modules/mappers/mod_so.h' 'modules/mappers/mod_speling.c' 'modules/mappers/mod_userdir.c' 'modules/mappers/mod_vhost_alias.c' 'modules/mappers/mod_watchdog.c' 'modules/mappers/mod_watchdog.h'

And by our assessment, these are 'mappers' module, but 'mod_alias' is considered as a module of its own.

So maybe we can use our classification and call them Components in other to differentiate, or we can do more research to reach a level ground of understanding.

About the component churn, in that case then files in the same folder will have the same number of churn. if so, it is better we have the metric on a separate table for components and join the table with gitlogfiles.

On Fri, Jan 18, 2013 at 10:36 PM, Andy Meneely notifications@github.comwrote:

Hm - that's a bit unexpected. I knew they had lots of modules, but I didn't realize that they were all literally one source file. Are you sure? When I browse the directory tree, for example, I see subdirectories under modules like "http" and "ldap" that have multiple source files. Maybe we can consider that to be one component?

Is there any way to tell if a module is in the core or not? Is it in the filepath? Do they record it on their website? Git doesn't keep design information, just file information.

Yes, component churn is also aggregated over the last 30 days (configurable, as before).

— Reply to this email directly or view it on GitHubhttps://github.com/apmeneel/httpd-history/issues/27#issuecomment-12449884.

andymeneely commented 11 years ago

Ok, then let's step back and examine what we're trying to do.

We're trying to group similar files together according to the system architecture so that we can identify recent, related code churn for a given file. For example, maybe there's been a lot of recent commits to HTTP packet parsing modules lately, but not this file, and yet this file is still at risk of introducing a vulnerability. If we make our grouping by just one file, then that makes no sense because then RecentComponentChurn would be very close to just RecentChurn.

So is there some kind of grouping (and maybe there isn't one) where we can do this? Maybe just a binary grouping of Core or Module? Or maybe there's a logical grouping of modules that we can discern? Is, say, mappers a logical grouping, or should that just be ignored?

amusa commented 11 years ago

In that case I think we should go with the package grouping as it is in the httpd directory.

On Sat, Jan 19, 2013 at 3:50 PM, Andy Meneely notifications@github.comwrote:

Ok, then let's step back and examine what we're trying to do.

We're trying to group similar files together according to the system architecture so that we can identify recent, related code churn for a given file. For example, maybe there's been a lot of recent commits to HTTP packet parsing modules lately, but not this file, and yet this file is still at risk of introducing a vulnerability. If we make our grouping by just one file, then that makes no sense because then RecentComponentChurn would be very close to just RecentChurn.

So is there some kind of grouping (and maybe there isn't one) where we can do this? Maybe just a binary grouping of Core or Module? Or maybe there's a logical grouping of modules that we can discern? Is, say, mappers a logical grouping, or should that just be ignored?

— Reply to this email directly or view it on GitHubhttps://github.com/apmeneel/httpd-history/issues/27#issuecomment-12460907.

amusa commented 11 years ago

This is partially completed (95%). The dbverify reports 618 files without component. The query needs to be fine tuned to update up to the highest file path of the component. Eg. Filepath Component server/mpm/mpmt_pthread/scoreboard.c, server server/mpm/mpmt_beos/scoreboard.c, server server/mpm/dexter/scoreboard.h, server server/mpm/dexter/scoreboard.c, server

amusa commented 11 years ago

done and awaiting testing.