Remove the current multi-process scan code (that nobody uses anyway) and replace it with generic, all-in-one worker processes.
Architecture
Architecture should be similar to the BrowserCluster but with processes instead of threads.
[ ] Use method(:my_handler) callbacks rather than proc{}s to help out the GC.
proc closures retain their env and we'll need to store a lot of callbacks.
[ ] Take advantage of copy-on-write so preload as much data as possible prior to forking.
[ ] Use Arachni::RPC for communication.
Use UNIX sockets when available, otherwise TCP/IP.
Disable SSL.
Disable compression.
[ ] Maybe have Dispatchers expose workers.
This will allow multiple machines to share one scan's workload when setup in Grid mode.
Similar to the existing multi-process system but much more efficient.
[ ] Should auto-scale by using #695.
Responsibilities
The workers should perform actions like:
[ ] HTML/XML parsing.
Can cause 100% usage when parsing very large documents, thus blocking the scan.
The Trainer will massively benefit from this, since it does a lot of parsing during page audits.
Should also perform the subsequent handling of the parsed document and send back the result instead of sending back the parsed document, otherwise there's no point to it.
[ ] Arachni::Support::Signature processing.
Signature generation, refinement and matching can cause 100% CPU usage when dealing with very large data sets, thus blocking the scan.
[ ] Manage browser processes.
The system is already launching Ruby life-line processes to ensure that PhantomJS processes don't zombie-out if the parent process disappears for whatever reason.
Since we're gonna have the workers, let them deal with that as well to keep the amount of overall processes to a minimum.
Another, less time consuming, approach, would be to re-purpose the multi code and change it so that:
The master instance will only:
Crawl.
Distribute the workload (i.e. discovered pages, deduplicate elements etc.).
Store issues and other report data as provided by slave instances.
The slave instances will:
Perform page audits.
Be completely dispensable -- not used to persist any data of any value.
TTL counted in page audits.
This will put an end to memory leak issues once and for all.
It will also take care of non-leak high RAM usage when processing large pages, since memory will never grow too large due to the GC not free-ing as an optimization.
The BrowserCluster will be a separate service.
Completely dispensable.
Job callbacks should include instance address and auth token.
TTL counted in jobs.
No-downtime TTL handling:
Have the dying service spawn its own replacement.
Notify clients of the switch of address.
Hand-off the existing state and workload to the spawn.
Remove the current multi-process scan code (that nobody uses anyway) and replace it with generic, all-in-one worker processes.
Architecture
Architecture should be similar to the
BrowserCluster
but with processes instead of threads.method(:my_handler)
callbacks rather thanproc{}
s to help out the GC.proc
closures retain their env and we'll need to store a lot of callbacks.Arachni::RPC
for communication.Responsibilities
The workers should perform actions like:
Trainer
will massively benefit from this, since it does a lot of parsing during page audits.Arachni::Support::Signature
processing.