OpenCTI-Platform / connectors

OpenCTI Connectors
https://www.opencti.io
Apache License 2.0
374 stars 403 forks source link

Report Import Connector does not import Domains with certain top level domains #430

Closed securitiz closed 3 years ago

securitiz commented 3 years ago

Problem to Solve

Report Importer does not process domains with certain domain names. I attempted to import a list of thousands of domains, and only domains with the following top level domains weren't processed: online, net, dog, fun, team, live, site, salon, tech, group, app.

It appears that these top level domains are referenced in data.py: https://github.com/fhightower/ioc-finder/blob/main/ioc_finder/data.py

Current Workaround

Input domains manually.

Proposed Solution

Update Report Import connector to parse domains with the aforementioned TLDs. Or, let me know that this is an issue with my instance :)

Additional Information

A list of domains that I was unable to parse with the Report Import connector is attached.

richard-julien commented 3 years ago

Looks like ioc-finder is directly use to find the domains.

"Domain-Name.value": ioc_finder.parse_domain_names,

Can you try if using ioc finder directly in your report have the expected demains in the output?

nor3th commented 3 years ago

Hmm this is odd. Those tlds are present in the data.py file you mentioned. I just gave it a try and those domains should be parseable

>>> import ioc_finder
>>> ioc_finder.parse_domain_names(' this is a http://googaa.net domain')
['googaa.net']
>>> ioc_finder.parse_domain_names(' this is a googaa.net domain')
['googaa.net']
>>> ioc_finder.parse_domain_names(' this is a googaa.fun domain')
['googaa.fun']
>>> ioc_finder.parse_domain_names(' this is a googaa.live domain')
['googaa.live']
>>> ioc_finder.parse_domain_names(' this is a googaa.live')
['googaa.live']
>>> ioc_finder.parse_domain_names(' this is a googaa.site')
['googaa.site']
>>> ioc_finder.parse_domain_names(' this is a googaa.salon')
['googaa.salon']
>>> ioc_finder.parse_domain_names(' this is a googaa.san')
[]

Can you restart the connector with debug logging activated - CONNECTOR_LOG_LEVEL=debug (in docker compose) and then you should see the report connector print a line like this for every line it parses.

"Text: {} -> extracts {}"

Can you please post those log entries for the tlds which didn't work.

nor3th commented 3 years ago

Hey @securitiz

I think I found the reason for your issue. After parsing the text, the report import connector also checks if the extracted text might match any existing entity. I was able to reproduce the issue with ".net" tlds, as you can see here

DEBUG:root:Value googa.net is whitelisted with re.compile('\\bNet\\b|\\bnet.exe\\b', re.IGNORECASE)
DEBUG:root:Value {'type': 'observable', 'category': 'Domain-Name.value', 'match': 'googa.net'} is also matched by entity tool

This then causes that the net tool is added to the report instead of the Domain. This is because the regex delimiter \b interprets a point as a delimiter and then matches the text to the Net tool. I will work on a better implementation to avoid this issue in the future.

Can you check which entities were added to the report when you imported that PDF file? Because I was not able to reproduce the issue with the "dog, live, site, salon" and a few other tlds.

securitiz commented 3 years ago

Hey @nor3th,

Great catch - you're right, it did create a Net Tool SDO. It's the only one that was created though.

nor3th commented 3 years ago

Ok, I am still unsure what the issue with the other tlds is. The other possibility for not having imported might be the same as #433 . I am working now anyways on migrating everything back to Stix objects and then the API execution error will be gone.

Could you maybe share the PDF with the domains or let the connector run with the debug output?

securitiz commented 3 years ago

It won't be until next week that I'll be able to modify and redeploy the compose, so here is the full list of domains in the meantime. all_kaseya_domains_github.txt

nor3th commented 3 years ago

Thanks for the list. After improving the whitelist and converting the whole data ingestion to STIX objects, all 1221 domains were imported successfully :) I wasn't able to reproduce the issue with any of the other tlds, but since the current official connector executes 2442 consecutive API requests (1 for creating the observable and another one for creating the relationship between the report and the observable), I wouldn't be surprised if #433 caused the import issue for a handful of domains. I'll push an update for the connector in a few days

securitiz commented 3 years ago

thanks @nor3th!