docintelapp / DocIntel

Open Source Platform for storing, organizing, and searching documents related to cyber threats
https://docintel.org
Other
157 stars 25 forks source link

synapse ingest failure for some URLs #37

Open brennane opened 1 year ago

brennane commented 1 year ago

this is for the synapse-cortex component. There is an issue with some HTML being invalid with storm language, here where a "," is in the URI:

[ inet:url=https://www.example.com/hello,world.html ] 

needs to be

[ "inet:url=https://www.example.com/hello,world.html" ] 
ancailliau commented 1 year ago

The extraction is incomplete. These errors will have no impact, the url will just be ignored.

brennane commented 1 year ago

It may be all the URLs scraped from the document. It will be confusing to an analyst why the URLs from some source document don't get indexed, since these errors are hidden from the web application. The error log showed something like 20 URLs being attempted in a large node-add operation.

storm> [  inet:url=https://www.example.com/hello ]
...........................
inet:url=https://www.example.com/hello
        :base = https://www.example.com/hello
        :fqdn = www.example.com
        :params = 
        :path = /hello
        :port = 443
        :proto = https
        .created = 2022/12/16 21:34:39.438
complete. 1 nodes in 59 ms (16/sec).

storm> [  inet:url=https://www.example.com/hello2,  inet:url=https://www.example.com/hello3,world  ]
...https://www.example.com/hello2,  inet:url=https://www.exampl...
                                 ^
Syntax Error: Unexpected token ',' at line 1, column 43, expecting one of: (, ), *, +, +(, -, -(, ., :$, <(, ], absolute property name, relative property name, universal property
complete. 0 nodes in 8 ms (0/sec).

storm> inet:url
inet:url=https://www.example.com/hello
        :base = https://www.example.com/hello
        :fqdn = www.example.com
        :params = 
        :path = /hello
        :port = 443
        :proto = https
        .created = 2022/12/16 21:34:39.438
complete. 1 nodes in 1 ms (1000/sec).