aboutcode-org / scancode.io

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!
https://scancodeio.readthedocs.io
Apache License 2.0
117 stars 86 forks source link

Create a pipeline to analyze a JavaScript/TypeScript-based app, including webpacked JS #650

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

When I use JavaScript in an application, I have a few typical configurations and contexts:

At development time:

At run time:

My problem is to find which subset of the devel codebase (the From side) has been deployed (the To side) knowing that the deployed JS may be:

pombredanne commented 1 year ago

In terms of solution we can start with:

  1. Inside the To codebase, relating .jsmap and .cssmap files to their .js or .css files
  2. Inside the To codebase, relating .scss to their .js or .css files
  3. Between the From and the To codebase, mapping .js and .css files on the To side to non-transpiled code on the From side, and design how good this mapping can be
  4. Later, if we do not get correct from path mapping, we could extend to introspecting the To .jsmap files to extract the source content and checksum-map it to the corresponding sources of the From side, and/or map using symbols.
pombredanne commented 1 year ago

We have two cases:

  1. Using heuristics, we can match the path of a deployed JS or CSS or SCSS file from the To-side to the From-side. If the path match conclusively, we can further validate some paths using actual content
  2. For css and JS map files, we further have a list of paths to the files used and combined/transpiled/compiled in a given CSS of JS file. For exmample:
    "sources":[
      "../../../../../../src/main/resources/META-INF/resources/style.scss",
      "../../../../../../../../../node_modules/@clayui/css/src/scss/cadmin/variables/_globals.scss"
    ],

    These path should be traced back and mapped to the To-side OR to third-party external code, like the @clayui/css npm package in this example.

To support this tracing we likely need to keep a list of these stored in the DB and have a way to relate each such path to either another From-side resource or a file inside a package, which means at least adding some DiscoveredPackage.

Adding these DiscoveredPackages (that may not be always deployed yet and not in the To-side at all) could be done this way:

To process map files, we can consider https://github.com/mattrobenolt/python-sourcemap or parse by hand since this is a simple JSON format. Note that webpack may sometimes create weird paths in the js.map files.

pombredanne commented 1 year ago

I suggest that we start with path matching this way processing each file one at a time:

If we match the path to an unambiguous package, we would:

  1. create a DiscoveredPackage
  2. relate this somehow to the map file
  3. assign some status or tag that tells us that this package was matched via path because of a map file "sources" reference
mjherzog commented 1 year ago

Regarding PurlDB matching we need to consider that some teams package their own code as private npm packages kept in a private repo so these will never appear in PurlDB.

pombredanne commented 1 year ago

This is mostly done and operational as steps in the develop_to_deploy pipeline ... so we are just keeping this open to create a JS-only pipeline