aaronkaplan / cti-llm

An LLM for CTI reports - to be presented at FIRST Fukuoka 2024
11 stars 1 forks source link

classes archicture and packages and taxonomy #14

Open priamai opened 6 months ago

priamai commented 6 months ago

Hi there what we discussed to simplify the class structure getting rid of the Factory and keeping one abstract class.

What about this we can have the following abstract classes: a) one class for NER: it will abstract spacy, flair, heuristics, spacy-llm, transformers etc b) one class for Summarization: it will abstract the summarization queries c) one class for ER: it will abstract the relationship extraction

Each extracted entity and relationship should be based on STI2.1 taxonomy:

The pip package will then used by the Django application.

aaronkaplan commented 6 months ago

Sounds good for me.

priamai commented 6 months ago

@Brandl we should change all our tag names to a STIX.

For example this:

# Define patterns
patterns = {
    "IPV4_ADDRESS": [{"TEXT": {"REGEX": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"}}],
    # IPv6 addresses are complex and may need a more sophisticated pattern
    "DOMAIN": [{"TEXT": {"REGEX": r"([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}"}}],
    "URL": [{"TEXT": {"REGEX": r"(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*/?"}}],
    "EMAIL_ADDRESS": [{"TEXT": {"REGEX": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"}}],
    "MD5_HASH": [{"TEXT": {"REGEX": r"\b[a-f0-9]{32}\b"}}],
    "SHA1_HASH": [{"TEXT": {"REGEX": r"\b[a-f0-9]{40}\b"}}],
    # SHA256, FILE_PATH, and others can be added similarly
    "CVE_ID": [{"TEXT": {"REGEX": r"CVE-\d{4}-\d{4,}"}}],
    # This matches to often, could be a problem later:
    "PORT_NUMBER": [{"TEXT": {"REGEX": r"\b\d{1,5}\b"}}],
    # REGISTRY_KEY pattern might need customization based on the specifics
}

Should instead be like like:

# Define patterns
patterns = {
    "IPv4Address.value": [{"TEXT": {"REGEX": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"}}]
}

I am changing my code to follow that notation which will make it very easy to tag and convert to STIX.

Let me know!

priamai commented 6 months ago

Also with regards to relationship mappings with STIX what we should do in the UI for the user is:

There are 3 common SRE: duplicate-of, derived-from,related-to and then 21 specific ones between objects.

In case of a sighting this should be allowed only between a date(s), a count entity and an indicator.

1.2.4 STIX Relationships A relationship is a link between STIX Domain Objects (SDOs), STIX Cyber-observable Objects (SCOs), or between an SDO and a SCO that describes the way in which the objects are related. Relationships can be represented using an external STIX Relationship Object (SRO) or, in some cases, through certain properties which store an identifier reference that comprises an embedded relationship, (for example the created_by_ref property).

The generic STIX Relationship Object (SRO) is one of two SROs and is used for most relationships in STIX. This generic SRO contains a property called relationship_type to describe more specifically what the relationship represents. This specification defines a set of known terms to use for the relationship_type property between SDOs of specific types. For example, the Indicator SDO defines a relationship from itself to Malware via a relationship_type of indicates to describe how the Indicator can be used to detect the presence of the corresponding Malware. In addition to the terms defined in the specification, STIX also allows for user-defined terms to be used as the relationship type.

Currently the only other SRO (besides a generic Relationship) is the Sighting SRO. The Sighting object is used to capture cases where an entity has "seen" an SDO, such as sighting an indicator. Sighting is a separate SRO because it contains additional properties such as count that are only applicable to Sighting relationships. Other SROs may be defined in future versions of STIX if new relationships are identified that also require additional properties not present on the generic Relationship object.