Performance improvements for large filesystems

mcutshaw commented 8 months ago

Is your feature request related to a problem? Please describe. We have been having some performance issues with "very large" filesystems > 1 million files.

Describe the solution you'd like A lot of these performance issues seem to stem from iterating through lists that contain each software or relationship entry. If we use Dicts or Sets where applicable we will be reducing the cost of those lookup operations from O(n) to amortized O(1).

Describe alternatives you've considered None

Additional context Please see https://github.com/mcutshaw/Surfactant/tree/perf_improvements for some possible adjustments. With one filesystem we've tested with it reduced a run that took several hours to less then 5 minutes.

nightlark commented 8 months ago

With the changes in the perf_improvements branch, does (de)serializing the dataclasses to/from JSON still result in the same output? The one thing to be careful of is changing the data classes in a way that results in a (CyTRICS) SBOM being output that doesn't pass validation with the (CyTRICS) schema.

If the java file changes make a difference, I'd merge a PR with that change now.

I'm not sure the early return for the find functions is valid -- if I'm understanding the logic, it looks like it will make the search exit as soon as the first non-matching entry is encountered, without considering that any subsequent entries in the various lists could be the one that actually matches; I think the only time the correct result would be returned in this scenario would be if the matching entry is the first one checked.

I've been mulling over decoupling the internal BOM data structures used by Surfactant from the CyTRICS schema -- the current set up definitely makes reading/writing CyTRICS SBOMs easy (json.[load|dump]), but makes other tasks really difficult.

Establishing relationships between files based on the filesystem structure would lend itself well to internally storing things as a graph representation, and should make most of the relationship establishing algorithms O(nlogn) instead of O(n^2) -- from a recently run, establishing relationships for a large number of files took most of the time. It would also make so we could just use graph edges to represent relationships between files, without the kind of awkward relationship lists.

I think this structure would still let us have a lookup table by hash for software entries. Or at least a set to tracks which hashes have been seen (md5, sha1, sha256), since ideally the common case is an entry not already existing with a matching hash.

Thoughts or comments?

mcutshaw commented 8 months ago

You are correct. On a second look I misunderstood that we were directly serializing the dataclass to JSON. I do believe that the lookup table provided the greatest performance improvements of the changes, so I would prefer to discuss ways to implement that and potentially just tack on the java changes as well.

For a quick fix we could just remove that field from the __dataclass_fields__ attribute like so:

INTERNAL_FIELDS = {"software_lookup_by_sha256"}

@dataclass_json
@dataclass
class SBOM:
    systems: List[System] = field(default_factory=list)
    hardware: List[Hardware] = field(default_factory=list)
    software: List[Software] = field(default_factory=list)
    relationships: Set[Relationship] = field(default_factory=set)
    analysisData: List[AnalysisData] = field(default_factory=list)
  observations: List[Observation] = field(default_factory=list)
  starRelationships: Set[StarRelationship] = field(default_factory=set)
  software_lookup_by_sha256: Dict = field(default_factory=dict)

  def __post_init__(self):
      self.__dataclass_fields__ = {
          k: v for k, v in self.__dataclass_fields__.items()
          if k not in INTERNAL_FIELDS
      }

It doesn't appear that the adjustments to change relationships to Sets or adding the __hash__ method effects the serialization.

Does this seem like something you would accept a pull request for, or would you want to wait until the internal structure is reworked?

mcutshaw commented 8 months ago

Also after looking at the "fail fast" logic I wrote previously, I think we should be alright to implement it if we just replace all of the returns I had with "continue" as once all_match is set to False there is no operation in that loop that will set it to True again. We could also cleanup the function as well as currently we evalutate if all_match is True and then just return True if it is:

So from:

  def has_relationship(
      self, xUUID: str = None, yUUID: str = None, relationship: str = None
  ) -> bool:
      for rel in self.relationships:
          all_match = True
          if xUUID and rel.xUUID != xUUID:
              all_match = False
          if yUUID and rel.yUUID != yUUID:
              all_match = False
          if relationship and rel.relationship.upper() != relationship.upper():
              all_match = False
          if all_match:
              return True
      return False

To:

def has_relationship(
    self, xUUID: str = None, yUUID: str = None, relationship: str = None
) -> bool:
    for rel in self.relationships:
        if xUUID and rel.xUUID != xUUID:
            continue
        if yUUID and rel.yUUID != yUUID:
            continue
        if relationship and rel.relationship.upper() != relationship.upper():
            continue
        return True
    return False

nightlark commented 8 months ago

Maybe a brief comment saying return True will only be reached if none of the continue statements are hit? While not uncommon, I suspect that there are some who might not be as familiar with its behavior compared to the more common break keyword.

For the dataclass changes, I'd mostly want to make sure that serializing sets to/from JSON works for all the versions of Python we're supporting (3.8+) -- and same for __dataclass_fields__.

mcutshaw commented 8 months ago

As far as testing against all supported python versions to verify I don't break the schema. The current tests run by Github action don't currently cover validating the test generations against the full schema, correct? Should I proceed with testing manually, or is there a better way?

nightlark commented 8 months ago

I think manually is the best way for now -- there should be some CI tests that try generating a complete SBOM, but there isn't any validation against the CyTRICS schema itself (which technically hasn't been released).

The main thing to look for is all of the fields that were changed to a "set" able to loaded and saved correctly, at least on Python 3.8 and 3.12 would probably be enough for reasonable confidence that things also work on the versions in between. A small script like this could be used to test outside of running Surfactant:

from surfactant.sbomtypes import SBOM

with open("test_sbom.json") as infile:
    SBOM.from_json(infile.read())
    # print some fields in SBOM, make sure the loaded info looks right

# create a test sbom = SBOM() object with some fake data added to the various changed field types
print(sbom.to_json(indent=2))

mcutshaw commented 7 months ago

I used this script and tested across 3.8, 3.10, and 3.12. Output files attached. Current branch is https://github.com/mcutshaw/Surfactant/tree/perf2. If this is sufficient I would plan to add some minor additional improvements (java, 1 read for all hashes, and then open a pull).

from surfactant.sbomtypes import SBOM
import platform
vers = platform.python_version()

with open("test_sbom.json") as infile:
    sbom = SBOM.from_json(infile.read())
    print(f"{len(sbom.relationships)} Relationships as sets: ")
    print(sbom.relationships)
    print()

    print("Full SBOM ")
    print(sbom)
    print()

    # print some fields in SBOM, make sure the loaded info looks right

print("SBOM back to JSON")
# create a test sbom = SBOM() object with some fake data added to the various changed field types
print(sbom.to_json(indent=2))
with open(f"test_output_sbom_{vers}.json", 'w') as f:
    f.write(sbom.to_json(indent=2))

3.8_test.log 3.10_test.log 3.12_test.log test_output_sbom_3.8.19.json test_output_sbom_3.10.12.json test_output_sbom_3.12.2.json

nightlark commented 7 months ago

Ah okay, that looks like it will be fine then (I guess dataclasses-json handles support for serializing set).

It looks like the java change is just stylistic -- functions called to the right of the in keyword are only called once.

LLNL / Surfactant

Performance improvements for large filesystems #155