Open armijnhemel opened 7 months ago
After having thought a bit more, you might want to make removing non-printable characters optional (but enabled by default) in case you would like to use symbols for source to source matching, as there they could be relevant.
I have some experience with building large databases with strings extracted from source code. Some of my findings:
libbb/lineedit.c
from a recent version of BusyBox. The whole list of characters that I am currently removing:Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)
My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.