HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
68 stars 23 forks source link

JSON exception on run #69

Closed troyhunt closed 11 months ago

troyhunt commented 11 months ago

I'm seeing a JSON exception which appears to be raised when parsing the list of allowable TLDs. Multiple different machines, clean build etc, can anyone see what's going on here?

[28/12/2023 10:02:12] Found 1 files:
[28/12/2023 10:02:12]   .csv   1 files: 0.4 Gb
[28/12/2023 10:02:12] Output will be saved to "D:\Temp\File.txt".
[28/12/2023 10:02:12] Report will be saved to "report.txt".
Press ANY KEY to continue ['Q' to Quit; 'I' for info]:
[28/12/2023 10:02:13] Extracting...
[28/12/2023 10:02:13] Reading "D:\Temp\file.csv" [0.4 Gb]
[28/12/2023 10:02:13] An error occurred while parsing 'D:\Temp\file.csv'L3: 'N' is invalid after a single JSON value. Expected end of data. Path: $ | LineNumber: 0 | BytePositionInLine: 12623.
[28/12/2023 10:02:13] An error occurred while parsing 'D:\Temp\file.csv'L2: 'N' is invalid after a single JSON value. Expected end of data. Path: $ | LineNumber: 0 | BytePositionInLine: 12623.
[28/12/2023 10:02:13] An error occurred while parsing 'D:\Temp\file.csv'L4: 'N' is invalid after a single JSON value. Expected end of data. Path: $ | LineNumber: 0 | BytePositionInLine: 12623.
[28/12/2023 10:02:13] An error occurred while parsing 'D:\Temp\file.csv'L5: 'N' is invalid after a single JSON value. Expected end of data. Path: $ | LineNumber: 0 | BytePositionInLine: 12623.
Continue? [y/n]:
alirobe commented 11 months ago

Using the JsonSerializer to read csv? TldFilter.cs:106

GStefanowich commented 11 months ago

@troyhunt

Can you run this again with the --debug flag so that a stacktrace is printed?

Not sure exactly where JSON would be failing here. The only place it's used is to cache the IANA TLD list so that it's not hit every time the program is run.


@alirobe

JsonSerializer isn't used to read the .csv itself, it's an exception bubbled up while using the .csv parser

GStefanowich commented 11 months ago

It may also be that the cached tld.json is simply corrupted somehow, if you want to check yours?

I'll see about adding some more checks to it

steves-bits commented 11 months ago

I must be doing it wrong as I do not see an error.

[28/12/2023 8:24:02 PM] Read 9,925,000 lines from "TestBreachDataV2.txt" [28/12/2023 8:24:02 PM] Read 9,950,000 lines from "TestBreachDataV2.txt" [28/12/2023 8:24:02 PM] Read 9,975,000 lines from "TestBreachDataV2.txt" [28/12/2023 8:24:03 PM] Read 10,000,000 lines from "TestBreachDataV2.txt" [28/12/2023 8:24:03 PM] Finished reading files [28/12/2023 8:24:03 PM] Extraction time: 1.2m [28/12/2023 8:24:03 PM] Addresses extracted: 10,000,000 [28/12/2023 8:24:03 PM] Read lines total: 10,000,000 [28/12/2023 8:24:03 PM] Read lines rate: 134,473/s

[28/12/2023 8:24:03 PM] Saving to disk.. [28/12/2023 8:24:28 PM] Addresses saved to addresses_output.txt [28/12/2023 8:24:28 PM] Report saved to report.txt

C:\Users\steph\Downloads\EmailAddressExtractor-main\src\bin\Debug\net8.0\AddressExtractor.exe (process 81092) exited with code 0. To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops. Press any key to close this window . . .

troyhunt commented 11 months ago

Thanks folks, seems more stable after the latest PR. I'm travelling at the moment so hard to devote time to properly debug, should have added that it was running ok in debug mode from VS but failing when directly running it via the exe. Anyway, let's see how this goes now, thanks all 😊