cramppet / regulator

Automated learning of regexes for DNS discovery
358 stars 40 forks source link

"IndexError: list index out of range" with large file #3

Closed hisxo closed 2 years ago

hisxo commented 2 years ago

Hello,

First of all, thanks for your tool. I really like this slightly different approach of generating subdomains list in a contextualized way.

I tried with some of my recon and unfortunately, it seems that with a large file (~15K subdomains) your tool can't working as expected.

I can't see specific things on log:

2022-10-17 05:12:18,032 - root - INFO - REGULATOR starting: MAX_RATIO=25.0, THRESHOLD=500
2022-10-17 05:12:18,129 - root - INFO - Loaded 15752 observations
2022-10-17 05:12:18,129 - root - INFO - Building table of all pairwise distances...
2022-10-17 05:18:20,965 - root - INFO - Table building complete
2022-10-17 05:18:20,966 - root - INFO - k=2
2022-10-17 05:22:04,814 - root - INFO - k=3
2022-10-17 05:26:04,571 - root - INFO - k=4
2022-10-17 05:30:39,711 - root - INFO - k=5
2022-10-17 05:35:37,222 - root - INFO - k=6
2022-10-17 05:41:41,556 - root - INFO - k=7
2022-10-17 05:49:15,489 - root - INFO - k=8
2022-10-17 05:57:37,004 - root - INFO - k=9
2022-10-17 06:07:44,429 - root - INFO - Prefix=0
[...]

After some minutes (~30min) :

Traceback (most recent call last):
  File "main.py", line 285, in <module>
    main()
  File "main.py", line 251, in main
    last, prefixes = None, sorted(list(set([first_token(k) for k in trie.keys(ngram)])))
  File "main.py", line 251, in <listcomp>
    last, prefixes = None, sorted(list(set([first_token(k) for k in trie.keys(ngram)])))
  File "main.py", line 205, in first_token
    return tokens[0][0][0]
IndexError: list index out of range

Thanks!

Regards.

cramppet commented 2 years ago

Hi there, thanks for submitting this issue!

From what you've shown, it looks like some of your input data might be malformed.

You can re-produce this error by supplying an input list containing .example.com for the example.com domain. Running the tool with this input will cause the error you're seeing.

I've pushed a change to the main branch that performs some input validation before using the supplied data. You should be able to quickly tell if things are being filtered as the log file will now contain any invalid hostnames detected before starting to build the memoization table.

Please let me know if this resolves the problem for you!

hisxo commented 2 years ago

Hi @cramppet

Thanks a lot, I confirm! I was sure my file was clean before to start generating subdomains, but I was wrong 😅

So I confirm it works perfectly, the logs display a warning now with Rejecting malformed input.

Cheers!