Dictionary size - Githubissues

kart61 commented 1 year ago

Could you please tell me which dictionary is best to use to train your model? For example, I set the parameter "--seq-length" to 12. Should I sort the words in the dictionary and leave the length only 12? Or is it better to leave words of different lengths, or as an option to leave words from 8 to 12 characters long? And is there any recommendation on the number of words in the dictionary?

beta6 commented 1 year ago

i recommend you not to train a dictionary because its enough in general to use the 3 already trained dicts. If you want seq length >=11 and a big dictionary you can train it but it will require a lot of ram memory. If you set the seq length to 12 for example it will train lengths of 12 chars or less (1-12). For more info, read the following article about dictionary training with passgan.

https://www.tuxrincon.com/es/blog/entrena-passgan-con-tu-propio-diccionario/

sorry for the late response. let me know if you need more help and gl hf. ;)

kart61 commented 1 year ago

Thank you very much for your answer. You really need a lot of RAM.
The article also helped a lot in understanding the settings, but I still don’t really understand how much you can deviate from the standard values in the settings. The 2 models I made did not perform very well. I attribute this to the not very good quality of the dictionaries used and perhaps inflated expectations :-) Therefore, the idea arose to check, for example, a dictionary with word lengths from 8 to 12. If you can suggest parameters that will significantly improve the quality of the model, I will be grateful.

beta6 commented 1 year ago

Okay, I'm glad it helped, even if just a little. Generally, it's safe to play with values from these parameters: [--save-every SAVE_EVERY] [--iters ITERS] [--batch-size BATCH_SIZE] [--seq-length SEQ_LENGTH]

You can also experiment with other parameters, but be cautious. An improper adjustment might have unintended consequences and could either ruin your model's training for a particular case or enhance its performance: [--layer-dim LAYER_DIM] [--critic-iters CRITIC_ITERS] [--lambda LAMB] Default values are a safe bet. I haven't experimented with these values, so I can't tell you if one value for 'layer dim' works better than another, etc. Trial and error should provide some insight, perhaps.

An important note: You should generate and infer the resulting dictionary, then combine it with the training dictionary. The recommended length of the resulting dictionary for pretrained models can range from hundreds to thousands of millions of words or more. You can refer to the article at https://www.tuxrincon.com/es/blog/passgan-como-generar-archivos-de-passwords-con-samplepy/ to learn how to generate dictionaries using sample.py from PassGAN.

For faster password generation, you can adjust the batch-size parameter. It's generally safe to set a high value here.

Think about passwords from the dictionary as patterns to repeat in your model. The larger the dictionary, the better, especially for patterns that repeat multiple times with different content.

I hope this solves your doubts, if not, let me know. Thanks for using this software.

beta6 / PassGAN

Dictionary size #5