RFC: Store symbols, DWARFs, package, function names, .js.map, and related

pombredanne commented 1 year ago

We need to design how to store symbols, DWARFs, Java package, or C# namespaces, function names, .js.map, and related symbols.

We can collect for instance DWARFs in SCTK with https://github.com/nexB/scancode-toolkit/issues/2422

We will soon support analyzing .js.map in SCIO with https://github.com/nexB/scancode.io/issues/650

In SCIO and later in SCTK and PurlDB, we need to have a proper way to store these symbols that are then used for devel_to_deploy mapping, indexed for matching for origin, and so on.

I suggest this approach:

A CodebaseResource (e.g., a file) would have one or more Symbol with these fields:

type: one of path reference, symbol, string, etc.
data_source: a code that describes what is the data source, like .js.map, DWARF, elf, Go pclntab, pdb, source code, etc.
symbol: the value of the symbol as as tring
sequence_number: the sequence of this symbol: starting at 1 for the first symbol of this type/data_source found in the file, and incremented in the order these are found.

For instance:

type: "path_reference"
data_source: ".js.map"
symbol: "../../main/src/UiCaroussel.js"
sequence_number: 1

pombredanne commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

armijnhemel commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

pombredanne commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

@armijnhemel yes, this is to store:

function, structure and variable names or strings extracted from an ELF, a Mach-o, PE or COFF binary, or a Go pclntab section, possibly demangled as needed
same and compilation unit paths from a DWARF debug symbol section, or a PDB symbol file
function, structure and variable names, strings, package names, namespaces, modules, interfaces, includes, imports and and class names extracted from parsing source code
symbol names and source path and source content extracted from .js.map and .css.map files
and likely a few other source and binary things of interest.

DennisClark commented 1 year ago

@pombredanne this all sounds good to me, especially if we support the same enhancements to the data structure in all of our AboutCode projects (but of course you already knew that).

chinyeungli commented 1 year ago

@pombredanne I am wondering will there be cases to have "duplicated" symbol in different sequence? such as

type: "path_reference"
data_source: ".js.map"
symbol: "../../main/src/UiCaroussel.js"
sequence_number: 1

type: "path_reference"
data_source: ".js.map"
symbol: "../../main/test/../src/UiCaroussel.js"
sequence_number: 7

If this may exist, how is the tool going to handle it?

pombredanne commented 1 year ago

@chinyeungli you wrote:

I am wondering will there be cases to have "duplicated" symbol in different sequence? such as [...] If this may exist, how is the tool going to handle it?

Excellent point! I think that at first we can store all the duplicates and then we can decide better ways to hand these later: it could be either to filter them out before storing in the db, or to have the code that process them further dedupe them and that could tag some of these (with a new attribute TBD later) as duplicated.

We will likely also need to have ways to track later how each of these may be further related to actual resources found in the codebase(s): for instance there is obviously a relationship we can create between a path reference and a file that may exist in the corresponding devel or deploy codebase. Here we likely want to start by resolving the relative parts of the path (as in your example) and then find a matching path in the devel codebase where ""../../main/src/UiCaroussel.js" may be mapped to "super-duper-1.2/webui/frontend/main/src/UiCaroussel. js" with a high confidence.

pombredanne commented 1 year ago

Some references issues and pointers to code to collect some of these "symbols":

various parsers in https://github.com/nexB/scancode-plugins/tree/main/binary-analysis/scancode-compiledcode/src/compiledcode for lightweight basic DWARF, ELF, GWT, Java class bytecode, Linux LKMs, Makedepend files, C/C++ includes, tags extraction from source code with CTags. Some of these parser rely on nm/objdump, readelf, and pyelftools to do their bidding.
some more older code for PDB and Mach-O formats
use of Pygments as a lightweight source code lexer/parser (that can then further be used with pygmars for more advanced analysis)
use of pefile: we already use it to parse Window assembly metadata and could extend its use to collect symbols from Windows PE/COFF DLLs and exes.
https://github.com/armijnhemel BAT and BANG that can do extensive heaylifting on the binaries

Some pending issues:

https://github.com/nexB/scancode-toolkit/issues/3140
https://github.com/nexB/scancode-toolkit/issues/2422
https://github.com/nexB/scancode-toolkit/issues/2981
some code we use to pre-process jsmap https://github.com/nexB/scancode-toolkit/blob/c15414bc48868e8cc7d8cd4c54f689411dfeb850/src/textcode/analysis.py#L126
some code for older symbolmap processing in GWT https://github.com/nexB/scancode-toolkit-contrib/blob/ef556c4bb2bfc513f486d5b58d43895c062d44cb/src/compiledcode/gwt.py#L71 (which is to map Java source -> Java bytecode -> minified JS seen in GWT)
WIP on JS devel/deploy mapping https://github.com/nexB/scancode.io/issues/650
WIP on Java devel/deploy mapping https://github.com/nexB/scancode.io/issues/649

armijnhemel commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

@armijnhemel yes, this is to store:
* function, structure and variable names or strings extracted from an ELF, a Mach-o, PE or COFF binary, or a Go pclntab section, possibly demangled as needed

* same and compilation unit paths from a DWARF debug symbol section, or a PDB symbol file

* function, structure and variable names, strings, package names, namespaces, modules, interfaces, includes, imports and and class names extracted from parsing source code

* symbol names and source path and source content extracted from .js.map and .css.map files

* and likely a few other source and binary things of interest.

There are many things that you can extract from ELF files. For individual symbols I would go for:

binding (local, global, weak, etc.)
section index (so it is easy to recognize imported symbols, which are in index 0)
visibility
type (function, object, file, etc.)
symbol versioning information

An example of symbol versioning information (from readelf output):

    10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND SSL_load_client_CA_file@OPENSSL_3.0.0 (3)

which could be useful for fingerprinting or dependency analysis (it shows that the symbol should be gotten from a library that has that particular OPENSSL ABI or so).

Not all of the information will be useful for fingerprinting but could be useful for dependency analysis.

In BANG I also compute various checksums of each of the individual sections, extract and process NOTE sections (which can contain provenance information as well), extract various build ids, and so on.

pombredanne commented 1 year ago

@armijnhemel Thanks!

re:

binding (local, global, weak, etc.)

section index (so it is easy to recognize imported symbols, which are in index 0)

visibility

type (function, object, file, etc.)

symbol versioning information

These would be for ELF only, right? but what about other? and the source side? I am trying to find out if we can design a general purpose data structure there. Or may be the value as a structure that varies with each file format and we can store some JSON field there?

armijnhemel commented 1 year ago

@armijnhemel Thanks!

re:

binding (local, global, weak, etc.)

section index (so it is easy to recognize imported symbols, which are in index 0)

visibility

type (function, object, file, etc.)

symbol versioning information

These would be for ELF only, right? but what about other? and the source side? I am trying to find out if we can design a general purpose data structure there. Or may be the value as a structure that varies with each file format and we can store some JSON field there?

Of course what I described is not for every ELF file. If there are no section headers, but just program headers it becomes more difficult (having section headers makes everything easier). There are also other interesting edge cases, such as compressed ELF binaries (example: UPX) or ELF wrappers around Android Dalvik bytecode (Oat file format, there are several variations).

For source code I typically use ctags (functions, methods, variables, and so on) and xgettext (for strings). What is useful is to store the file name in which things were found and possibly also the line numbers where an identifier can be found (plural, because there could be multiple definitions).

For ELF binaries containing C++ you want to consider either mangling or demangling when comparing source and binary, but maybe not when comparing binary to binary or source to source.

Regarding other file formats: I am assuming you already know enough about Java class files (hint: also look at jimage, as that is different from Java class files: https://hg.openjdk.org/jdk9/jdk9/jdk/file/tip/src/java.base/share/native/libjimage/imageFile.hpp although I am not seeing it very often). In Java there are also flags (static, final, etc.) that could be interesting.

Android Dalvik is an interesting one: the code can be spread out across multiple binary files (classes.dex, classes2.dex, etc.) so you should not only look at a single file. Typically everything is also linked into a single binary (or multiple binaries when spread as described in the previous sentence), so all of the code of a program plus the dependencies. This is equivalent to static linking for ELF (except that a lot of the method information is somewhat retained, please read on) so it isn't clear how useful it is for for example fingerprinting. Then to make things even more complex the compiler now by default obfuscates names, so there isn't a 1:1 mapping from names in source code to names found in binary code (and you need to use some other methods to map methods in the binary to methods in the source code).

So personally I think that some JSON makes more sense instead of trying to force everything into a single model.

armijnhemel commented 1 year ago

After having spent a few days diving into Java .class files a bit more I am now very much convinced it is impossible to shoehorn everything into a single model. Having some JSON per file type makes more sense. Of course, the challenge here would be how to implement searches, especially if there are nested structures (which are common in Java class files, for example attributes).

aboutcode-org / scancode.io

RFC: Store symbols, DWARFs, package, function names, .js.map, and related #689