xgettext: don't strip by default, ignore empty strings and all whitespace strings, process special characters such as the "bell character"

I have some experience with building large databases with strings extracted from source code. Some of my findings:

ignore empty strings: you will find that many strings will be the empty string. These are quite useless for anything related to matching.
some strings will be white space only (before stripping). These tend to be useless as well.
there are quite a few characters that cannot be printed, such as the ASCII bell character. You might want to remove these. A test example would be the file libbb/lineedit.c from a recent version of BusyBox. The whole list of characters that I am currently removing:

 ['\a', '\b', '\v', '\f', '\x01', '\x02', '\x03', '\x04',
  '\x05', '\x06', '\x0e', '\x0f', '\x10', '\x11', '\x12',
  '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19',
  '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', '\x7f']

Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)

My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.

aboutcode-org / source-inspector

xgettext: don't strip by default, ignore empty strings and all whitespace strings, process special characters such as the "bell character" #11