corgibytes / freshli-lib

A tool for collecting historical metrics about a project's dependencies
MIT License
17 stars 1 forks source link

Parsing dependencies design #232

Closed mrbiggred closed 2 years ago

mrbiggred commented 3 years ago

This design ticket was spawned from our discussion on how to implement #215 (Create Python Micro-Service for Python Resource Parsing). Writing custom code to parse some dependency files is relatively easy while other languages, like Python, it can be difficult. Languages already have the logic to parse their respective dependency files and it would be great if we could reuse that.

The design problem we need to solve is how does do custom dependency parsers get hooked into Freshli Core? How would Freshli Core, written in .NET, initiate a function in another language (Python, Rust, etc) and get the dependencies back?

Which is easier to implement and maintain? Duplicating the dependency parsing code or having mini-services/libs/containers?

The first step we decided on is to figure out the common format that Freshli needs to for dependencies. For example, the package name, version, etc. Once that is decided we can do some experiments. Some ideas we could try:

Some things to consider when running the experiments:

To sum up, currently we have two tasks:

1) Figure out the common dependency format Freshli needs. 2) Experiment with different options.

The actual implementation of our design decisions will be implemented in another issue. This issue will be kept for design decisions.

dan-hein commented 3 years ago

Changes in progress can be found here.

Introduction

Here is information regarding the generic implementation side of this idea. The implementation might need to change based on how we decide to approach the service feature. Below are three main requirements that are needed.

  1. First and foremost, each individual language implementation will need to pass in a parameter or bypass freshli manifest detection if a file is provided. If we are relying on the services in the future, we need to somehow tie in the manifest/framework detection into the generic implementation and somehow know what service to tie that into. I don't have any fleshed-out ideas for this yet, but it is something we need to think about.
  2. For the generic repository system, we need a generic manifest format that Freshli can use to parse and understand for processing. I have an initial version of this sketched out in the branch above. Here is the Python code I used to generate this file:
    from pathlib import Path
    from pkg_resources import parse_requirements
    import json
    from json import JSONEncoder
    class Spec:
    def __init__(self, spec):
        self.operator = spec[0]
        self.version = spec[1]
    class Dependency:
    def __init__(self, name, allows_prerelease, specs):
        self.name = name
        self.allows_prerelease = allows_prerelease
        self.specs = []
        for spec in specs:
            self.specs.append(Spec(spec))
    class Manifest:
    def __init__(self, date_updated):
        self.dependencies = []
        self.date_updated = date_updated
    class ComplexEncoder(JSONEncoder):
    def default(self, obj):
        return JSONEncoder.default(self, obj.__)
    with Path(r'/Users/danhein/PycharmProjects/spaCy/requirements.txt') as p:
    install_reqs = list(parse_requirements(p.open().readlines()))
    manifest = Manifest(p.stat().st_mtime)
    for req in install_reqs:
    manifest.dependencies.append(Dependency(req.name, req.specifier.prereleases, req.specs))
    json_dump = json.dumps(manifest, default=lambda x: x.__dict__)
    with open('test_maifest_1.json', 'a') as of:
    of.write(json_dump)

    Here is an example of data provided by this code:

    {
    "dependencies": [
    {
      "name": "spacy-legacy",
      "allows_prerelease": false,
      "specs": [
        {
          "operator": "<",
          "version": "3.1.0"
        },
        {
          "operator": ">=",
          "version": "3.0.0"
        }
      ]
    },
    {
      "name": "cymem",
      "allows_prerelease": false,
      "specs": [
        {
          "operator": ">=",
          "version": "2.0.2"
        },
        {
          "operator": "<",
          "version": "2.1.0"
        }
      ]
    },
    {
      "name": "hypothesis",
      "allows_prerelease": null,
      "specs": []
    }
    ],
    "date_updated": 1612463560.2651062
    }
  3. Lastly, we need a universal format to represent the actually dependency management system, as, in Freshli, we will query URLs to typically retrieve and parse this information. I believe that this can be similar to the format above, but it's been a little trickier to hash out. I have some ideas, but with my current NIH commitment, I've been unable to spend a lot of time on this thread.

Status

I've determined the needs listed above with a few scraps of Python code that have been very useful! I'll include those here. Currently, this idea still needs to have the generic Repository/VersionInfo fleshed out. I have not implemented the source file idea for this yet, but it should be a similar JSON format to the repository format listed above. I'm very open to any ideas on how to accomplish this, however, as I think this will be the project's largest challenge.

Final thoughts

I wanted to get all of these initial ideas down today, so there might be some excluded details, so let me know if you need any clarification! Unfortunately, my commitment to NIH is starting to take over most of my time, so I'm not sure how much effort I will be able to put into this idea. If at any time you have questions, I can definitely make myself available for answers, input, or rubber-ducking 😄

Edited by @mscottford on Thursday, October 7, 2021 to fix link to the branch that contains work-in-progres

mscottford commented 3 years ago

My preference for this is to define a small HTTP/REST API to query the information. The API would receive the text of a dependency manifest, and then it would parse that manifest, and look-up the information that's required for the Freshli library to do its calculations. The Freshli library would then be configured with the URLs that it needs to communicate with to get the information for a particular manifest file.

This creates some choice around how to implement each of those REST APIs. One approach would be to host those APIs on the public internet. Another would be to launch them via Docker containers before the library is asked to do it's work.

For either hosting approach, we'd also have a choice in how we build out the REST API. We could stick with .NET Core for the API creation, and then call a small command line application that's written in the supported language ecosystem. That command line application would then become a dependency of that REST API. If we make the output of each command line application consistent, then we could use a single REST API codebase to handle these requests.

For packaging and delivery, the REST API plus supported command-line applications could be packaged as Docker containers so that people can run them locally.

For ease of use, then freshli-cli project could have the responsibility for spinning up the containers that are needed, so that consumers of the CLI project would not need to worry about the availability of the an API.

Assumptions: This approach has the following assumptions.

  1. Access to the scanned codebase is not required to determine which files contain dependency information.
  2. Dependency information is stored in a small number of files (likely one) and these files are small enough that sending them via HTTP will not create significant performance issues.

Edit: Adds assupmtions

mscottford commented 2 years ago

This responsibility now lives in Freshli-CLI. CycloneDX has been selected as a consistent intermediate file format.