Before running, make sure to remove the testing files from the DomainOutput
and TwitterOutput
directories
cd Post-Processor
python3 processor.py
The post-processor also supports multi-processing for more efficient performance, to utilize this feature, run python3 processor.py -num_procs=x -limit=y
where x
is the number of processes to use and y
is the memory limit (in bytes) of the local data after which it will be written to disk. Increasing -limit
will prevent memory errors but may reduce performance speed. Recommended usage: python3 processor.py -num_procs=10 -limit=5000000
Required files and folder structure within Post-Processor directory:
output.xlsx will include an row for URL x from DomainOutput iff:
Note: read_from_memory flag is can be manually turned on and off on processor.py main. If picking up the processor from a previous break, then run the program with read_from memory set to True.
Test
was a branch that was archived.
Can be restored by the following command: git checkout -b Test archive/Test
It was archived like this:
Another great resource