This implements Channels for running through lines concurrently, it uses a single-writer (File reader) format with multiple readers (Running the lines through the Regex).
You can customize the Channel size with --threads #
By default it will use 4 tasks, which will result in roughly the following performance:
Extraction time: 59s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 160,331/s
- Run regex x10,000,001 | Took 2.9m (at ~17μs per)
- Generate capture x10,000,000 | Took 35s (at ~3μs per)
- Check length x10,000,000 | Took 6s (at ~1μs per)
- Strict Match x10,000,000 | Took 21s (at ~2μs per)
- Domain filter x10,000,000 | Took 16s (at ~2μs per)
- Filter invalids x9,628,689 | Took 15s (at ~2μs per)
- Filter quotes x9,451,035 | Took 10s (at ~1μs per)
- TLD Filter x9,451,035 | Took 4s (at ~0μs per)
- Read file x1 | Took 59s (at ~59s per)
- Read line x10,000,001 | Took 58s (at ~6μs per)
Extraction time: 54s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 175,389/s
- Run regex x10,000,001 | Took 2.6m (at ~16μs per)
- Generate capture x10,000,000 | Took 32s (at ~3μs per)
- Check length x10,000,000 | Took 6s (at ~1μs per)
- Strict Match x10,000,000 | Took 18s (at ~2μs per)
- Domain filter x10,000,000 | Took 16s (at ~2μs per)
- Filter invalids x9,628,689 | Took 14s (at ~1μs per)
- Filter quotes x9,451,035 | Took 8s (at ~1μs per)
- TLD Filter x9,451,035 | Took 3s (at ~0μs per)
- Read file x1 | Took 54s (at ~54s per)
- Read line x10,000,001 | Took 53s (at ~5μs per)
I tacked on a 0 to get --threads 40, and pushed 180k/s, but at some point there's diminishing return so I'm not sure the "optimal" channel size to use. YMMV using the --threads option.
I also cleaned up some stuff so when testing the 1.9gb test dump it wouldn't keep writing to disk afterwards, -o and -r can now accept "" from the commandline and will not save when so. Previously arg[0] was used, so providing string.Empty would throw an out of bounds exception.
I cleaned up DebugPerformanceStack a bunch so that it'll properly handle accepting from multiple threads at the same time.
I also changed it so that --debug will print a full stacktrace when an exception occurs, so #54 should be easier to pinpoint.
This implements Channels for running through lines concurrently, it uses a single-writer (File reader) format with multiple readers (Running the lines through the Regex).
You can customize the Channel size with
--threads #
By default it will use
4
tasks, which will result in roughly the following performance:I tacked on a
0
to get--threads 40
, and pushed 180k/s, but at some point there's diminishing return so I'm not sure the "optimal" channel size to use. YMMV using the--threads
option.I also cleaned up some stuff so when testing the 1.9gb test dump it wouldn't keep writing to disk afterwards,
-o
and-r
can now accept""
from the commandline and will not save when so. Previouslyarg[0]
was used, so providingstring.Empty
would throw an out of bounds exception.I cleaned up
DebugPerformanceStack
a bunch so that it'll properly handle accepting from multiple threads at the same time.I also changed it so that
--debug
will print a full stacktrace when an exception occurs, so #54 should be easier to pinpoint.