aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
109 stars 83 forks source link

RFC: Store symbols, DWARFs, package, function names, .js.map, and related #689

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

We need to design how to store symbols, DWARFs, Java package, or C# namespaces, function names, .js.map, and related symbols.

We can collect for instance DWARFs in SCTK with https://github.com/nexB/scancode-toolkit/issues/2422

We will soon support analyzing .js.map in SCIO with https://github.com/nexB/scancode.io/issues/650

In SCIO and later in SCTK and PurlDB, we need to have a proper way to store these symbols that are then used for devel_to_deploy mapping, indexed for matching for origin, and so on.

I suggest this approach:

A CodebaseResource (e.g., a file) would have one or more Symbol with these fields:

For instance:

pombredanne commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

armijnhemel commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

pombredanne commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

@armijnhemel yes, this is to store:

DennisClark commented 1 year ago

@pombredanne this all sounds good to me, especially if we support the same enhancements to the data structure in all of our AboutCode projects (but of course you already knew that).

chinyeungli commented 1 year ago

@pombredanne I am wondering will there be cases to have "duplicated" symbol in different sequence? such as

type: "path_reference"
data_source: ".js.map"
symbol: "../../main/src/UiCaroussel.js"
sequence_number: 1
type: "path_reference"
data_source: ".js.map"
symbol: "../../main/test/../src/UiCaroussel.js"
sequence_number: 7

If this may exist, how is the tool going to handle it?

pombredanne commented 1 year ago

@chinyeungli you wrote:

I am wondering will there be cases to have "duplicated" symbol in different sequence? such as [...] If this may exist, how is the tool going to handle it?

Excellent point! I think that at first we can store all the duplicates and then we can decide better ways to hand these later: it could be either to filter them out before storing in the db, or to have the code that process them further dedupe them and that could tag some of these (with a new attribute TBD later) as duplicated.

We will likely also need to have ways to track later how each of these may be further related to actual resources found in the codebase(s): for instance there is obviously a relationship we can create between a path reference and a file that may exist in the corresponding devel or deploy codebase. Here we likely want to start by resolving the relative parts of the path (as in your example) and then find a matching path in the devel codebase where ""../../main/src/UiCaroussel.js" may be mapped to "super-duper-1.2/webui/frontend/main/src/UiCaroussel. js" with a high confidence.

pombredanne commented 1 year ago

Some references issues and pointers to code to collect some of these "symbols":

Some pending issues:

armijnhemel commented 1 year ago

@DennisClark @keshav-space @armijnhemel @mjherzog @chinyeungli ping... feedback welcome! This will be to support various analysis such as the devel_to_deploy.

Are you talking about things like function names from ELF files?

@armijnhemel yes, this is to store:

* function, structure and variable names or strings extracted from an ELF, a Mach-o, PE or COFF binary, or a Go pclntab section, possibly demangled as needed

* same and compilation unit paths from a DWARF debug symbol section, or a PDB symbol file

* function, structure and variable names, strings, package names, namespaces, modules, interfaces, includes, imports and and class names extracted from parsing source code

* symbol names and source path and source content extracted from .js.map and .css.map files

* and likely a few other source and binary things of interest.

There are many things that you can extract from ELF files. For individual symbols I would go for:

An example of symbol versioning information (from readelf output):

    10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND SSL_load_client_CA_file@OPENSSL_3.0.0 (3)

which could be useful for fingerprinting or dependency analysis (it shows that the symbol should be gotten from a library that has that particular OPENSSL ABI or so).

Not all of the information will be useful for fingerprinting but could be useful for dependency analysis.

In BANG I also compute various checksums of each of the individual sections, extract and process NOTE sections (which can contain provenance information as well), extract various build ids, and so on.

pombredanne commented 1 year ago

@armijnhemel Thanks!

re:

  • binding (local, global, weak, etc.)
  • section index (so it is easy to recognize imported symbols, which are in index 0)
  • visibility
  • type (function, object, file, etc.)
  • symbol versioning information

These would be for ELF only, right? but what about other? and the source side? I am trying to find out if we can design a general purpose data structure there. Or may be the value as a structure that varies with each file format and we can store some JSON field there?

armijnhemel commented 1 year ago

@armijnhemel Thanks!

re:

  • binding (local, global, weak, etc.)
  • section index (so it is easy to recognize imported symbols, which are in index 0)
  • visibility
  • type (function, object, file, etc.)
  • symbol versioning information

These would be for ELF only, right? but what about other? and the source side? I am trying to find out if we can design a general purpose data structure there. Or may be the value as a structure that varies with each file format and we can store some JSON field there?

Of course what I described is not for every ELF file. If there are no section headers, but just program headers it becomes more difficult (having section headers makes everything easier). There are also other interesting edge cases, such as compressed ELF binaries (example: UPX) or ELF wrappers around Android Dalvik bytecode (Oat file format, there are several variations).

For source code I typically use ctags (functions, methods, variables, and so on) and xgettext (for strings). What is useful is to store the file name in which things were found and possibly also the line numbers where an identifier can be found (plural, because there could be multiple definitions).

For ELF binaries containing C++ you want to consider either mangling or demangling when comparing source and binary, but maybe not when comparing binary to binary or source to source.

Regarding other file formats: I am assuming you already know enough about Java class files (hint: also look at jimage, as that is different from Java class files: https://hg.openjdk.org/jdk9/jdk9/jdk/file/tip/src/java.base/share/native/libjimage/imageFile.hpp although I am not seeing it very often). In Java there are also flags (static, final, etc.) that could be interesting.

Android Dalvik is an interesting one: the code can be spread out across multiple binary files (classes.dex, classes2.dex, etc.) so you should not only look at a single file. Typically everything is also linked into a single binary (or multiple binaries when spread as described in the previous sentence), so all of the code of a program plus the dependencies. This is equivalent to static linking for ELF (except that a lot of the method information is somewhat retained, please read on) so it isn't clear how useful it is for for example fingerprinting. Then to make things even more complex the compiler now by default obfuscates names, so there isn't a 1:1 mapping from names in source code to names found in binary code (and you need to use some other methods to map methods in the binary to methods in the source code).

So personally I think that some JSON makes more sense instead of trying to force everything into a single model.

armijnhemel commented 1 year ago

After having spent a few days diving into Java .class files a bit more I am now very much convinced it is impossible to shoehorn everything into a single model. Having some JSON per file type makes more sense. Of course, the challenge here would be how to implement searches, especially if there are nested structures (which are common in Java class files, for example attributes).