aboutcode-org / source-inspector

Tools to inspect source code and code symbols
0 stars 1 forks source link

xgettext: don't strip by default, ignore empty strings and all whitespace strings, process special characters such as the "bell character" #11

Open armijnhemel opened 7 months ago

armijnhemel commented 7 months ago

I have some experience with building large databases with strings extracted from source code. Some of my findings:

 ['\a', '\b', '\v', '\f', '\x01', '\x02', '\x03', '\x04',
  '\x05', '\x06', '\x0e', '\x0f', '\x10', '\x11', '\x12',
  '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19',
  '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', '\x7f']

Currently you are not doing those clean ups. On the other hand you are stripping regular strings, where (I think) whitespace could be relevant. If you want to clean up, then at least you should be consistent :-)

My advise: do not strip strings, ignore empty strings or whitespace only strings, remove non-printable characters.

armijnhemel commented 7 months ago

After having thought a bit more, you might want to make removing non-printable characters optional (but enabled by default) in case you would like to use symbols for source to source matching, as there they could be relevant.