HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

Implemented Channels #55

Closed GStefanowich closed 1 year ago

GStefanowich commented 1 year ago

This implements Channels for running through lines concurrently, it uses a single-writer (File reader) format with multiple readers (Running the lines through the Regex).

You can customize the Channel size with --threads #

By default it will use 4 tasks, which will result in roughly the following performance:

Extraction time: 59s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 160,331/s

 - Run regex x10,000,001 | Took 2.9m (at ~17μs per)
   - Generate capture x10,000,000 | Took 35s (at ~3μs per)
   - Check length     x10,000,000 | Took 6s (at ~1μs per)
   - Strict Match     x10,000,000 | Took 21s (at ~2μs per)
   - Domain filter    x10,000,000 | Took 16s (at ~2μs per)
   - Filter invalids  x9,628,689 | Took 15s (at ~2μs per)
   - Filter quotes    x9,451,035 | Took 10s (at ~1μs per)
   - TLD Filter       x9,451,035 | Took 4s (at ~0μs per)
 - Read file x1 | Took 59s (at ~59s per)
   - Read line x10,000,001 | Took 58s (at ~6μs per)
Extraction time: 54s
Addresses extracted: 9,451,035
Read lines total: 9,451,035
Read lines rate: 175,389/s

 - Run regex x10,000,001 | Took 2.6m (at ~16μs per)
   - Generate capture x10,000,000 | Took 32s (at ~3μs per)
   - Check length     x10,000,000 | Took 6s (at ~1μs per)
   - Strict Match     x10,000,000 | Took 18s (at ~2μs per)
   - Domain filter    x10,000,000 | Took 16s (at ~2μs per)
   - Filter invalids  x9,628,689 | Took 14s (at ~1μs per)
   - Filter quotes    x9,451,035 | Took 8s (at ~1μs per)
   - TLD Filter       x9,451,035 | Took 3s (at ~0μs per)
 - Read file x1 | Took 54s (at ~54s per)
   - Read line x10,000,001 | Took 53s (at ~5μs per)

I tacked on a 0 to get --threads 40, and pushed 180k/s, but at some point there's diminishing return so I'm not sure the "optimal" channel size to use. YMMV using the --threads option.


I also cleaned up some stuff so when testing the 1.9gb test dump it wouldn't keep writing to disk afterwards, -o and -r can now accept "" from the commandline and will not save when so. Previously arg[0] was used, so providing string.Empty would throw an out of bounds exception.


I cleaned up DebugPerformanceStack a bunch so that it'll properly handle accepting from multiple threads at the same time.


I also changed it so that --debug will print a full stacktrace when an exception occurs, so #54 should be easier to pinpoint.