SAP / credential-digger

A Github scanning tool that identifies hardcoded credentials while filtering the false positive data through machine learning models :lock:
Apache License 2.0
318 stars 49 forks source link

PasswordModel tokenizer error #214

Open marcorosa opened 3 years ago

marcorosa commented 3 years ago

Sometimes, the scan fails due to a tokeniser error raised by the PasswordModel

For example (scanning repo https://github.com/wuest-amiconsult/BTP-Day2-Bookshop-Exercise)

Exception in thread credentialdigger@https://github.com/wuest-amiconsult/BTP-Day2-Bookshop-Exercise:                                                              
Traceback (most recent call last):                                                                                                                                
  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner                  
    self.run()                                                                                                                                                    
  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 910, in run                               
    self._target(*self._args, **self._kwargs)                                                                                                                     
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/client.py", line 793, in scan    
    return self._scan(                                                                                                                                            
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/client.py", line 1142, in _scan  
    self._analyze_discoveries(mm, password_discoveries, debug)                                                                                                    
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/client.py", line 1225, in _analyze_discoveries
    model_manager.launch_model_batch(discoveries)
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/models/model_manager.py", line 66, in launch_model_batch
    return self.model.analyze_batch(discoveries)
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/models/password_model.py", line 50, in analyze_batch
    data = self._pre_process([d['snippet'] for d in new_discoveries])
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/credentialdigger-4.5.0-py3.9.egg/credentialdigger/models/password_model.py", line 105, in _pre_process
    encodings = self.tokenizer(snippet,
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2404, in __call__
    return self.batch_encode_plus(
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2589, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 720, in _batch_encode_plus
    batch_outputs = self._batch_prepare_for_model(
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 792, in _batch_prepare_for_model
    batch_outputs = self.pad(
  File "/Users/marco/git/credential-digger/venv3/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2714, in pad
    raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []
marcorosa commented 2 years ago

Fix released in #228

marcorosa commented 2 years ago

This error raised again, so it was not properly fixed