OpenCTI-Platform / connectors

OpenCTI Connectors
https://www.opencti.io
Apache License 2.0
382 stars 414 forks source link

[ImportFilePDFStix] Create the connector #304

Closed SamuelHassine closed 1 year ago

SamuelHassine commented 3 years ago

Problem to Solve

When uploading a PDF file corresponding to a report, this connector should be able to extract STIX knowledge from it using NLP.

Current Workaround

None.

Proposed Solution

Create a connector using NLP.

Additional Information

None.

amr-cossi commented 3 years ago

We do have some R&D being tested internally for this but we are far from an open-source release, and not sure if it will be possible at all as some commercial partners are involved in the project.

2xyo commented 3 years ago

For reference https://trial.elemendar.com/

Our free to use trial AI engine translates your CTI uploads and Threat Intel from the Web from their human authored content into machine readable and actionable data in STIX 2.0 now incorporating MITRE ATT&CK™.

(Disappointing result after a rapid test)

Lee-Elemendar commented 3 years ago

Hi There, We noticed your comment about our AI for CTI READ application at trial.elemendar.com We are sorry to hear you experienced a disappointing result. Our accuracy is always improving. Please do try more tests/documents and let us know if you have specific comments/errors. We will shortly be releasing an Open CTI connector so all feedback is really important to us.

Thank you Lee - Elemendar Open CTI Project Admin

2xyo commented 3 years ago

On this topic, TRAM v1.0.0 has been released a few days ago on https://github.com/center-for-threat-informed-defense/tram

TRAM enables researchers to test and refine Machine Learning (ML) models for identifying ATT&CK techniques in prose-based cyber threat intel reports and allows threat intel analysts to train ML models and validate ML results.

Right now, it's just possible to train a model to recognize ATT&CK techniques. Not sure that entities/relationships extraction is on the roadmap.

2xyo commented 3 years ago

Also on this topic: "Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph" https://www.sciencedirect.com/science/article/pii/S0950705121007863

Open-CyKG:an Open Cyber Threat Intelligence (CTI) Knowledge Graph (KG) framework that is constructed usingan attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threatinformation from unstructured Advanced Persistent Threat (APT) reports. More specifically, we firstidentify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) thataids in labeling relation triples generated by the OIE model. Afterwards, the extracted structureddata is canonicalized to build the KG by employing fusion techniques using word embeddings.

Notebook: https://github.com/IS5882/Open-CyKG

2xyo commented 2 years ago

And some recent work of @fkie:

nor3th commented 2 years ago

Is there a point for developing this connector? Extracting a STIX bundle from a PDF file is a pain in the ass. Wouldn't a feasible alternative be to simply ask the creator of the PDF file to simply ship the STIX bundle as JSON?

2xyo commented 2 years ago

@nor3th : Agree, it's 100% a pain in the ass.

As I'm no longer a full-time CTI analyst who had to work with PDF or HTML files, It's no more my problem. :grin:

However, I have a thought for all analysts who have to deal with unstructured documents (txt/pdf/html). :zipper_mouth_face:

And asking authors to provide STIX packages is IMHO a nice dream. The simplest use case is manual data ingestion of public data from blog posts of CTI companies. They provide STIX2 just to paid customers.

SamuelHassine commented 1 year ago

Covered by the import-report connector. And will be part of on-going work on full text indexation and NLP in the core platform.