Broken regexp in is_relative_path()

aboutcode-org / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!

https://aboutcode.org/scancode/

2.12k stars 547 forks source link

Broken regexp in is_relative_path() #531

Open jwilk opened 7 years ago

jwilk commented 7 years ago

textcode.strings.is_relative_path() is completely broken:

>>> from textcode.strings import is_relative_path
>>> is_relative_path("/usr")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jwilk/.local/lib/python2.7/site-packages/textcode/strings.py", line 200, in
    relative = re.compile('^(?:([^/]|\.\.)[\w_\-]+/.*$', re.IGNORECASE).match
  File "/usr/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

Tested against git HEAD (bca0af4a6ff10f4ef0bef6459f2f06a17d21f9f6).

Found using pydiatra.

pombredanne commented 7 years ago

Thanks! @jwilk I like how you use your tools. e.g. https://github.com/jwilk/pydiatra/ Very smart and duly noticed! would you care to submit a patch? Note that the code you find this bug in is actually not used and never was, though it could and should!

jwilk commented 7 years ago

I don't quite understand the purpose of this function, so I'll somebody else patch it.

pombredanne commented 7 years ago

@jwilk sure. I will handle this and thanks again for the report. I am impressed by your tool! FWIW, the purpose is/was to detect if a string extracted from a binary may be a relative file path of sorts

pombredanne commented 7 years ago

In the end using a naive bayesian classifier may be more efficient. TBD.

tina1998612 commented 7 years ago

@pombredanne Will the problem be solved if I change the re.compile to this?: relative = re.compile(r'^(?:[^/]|\.+\/|\/)(([\w_\-]+\/)+|)[\w_\-]+\.[\w]+$', re.IGNORECASE).match

the parenthesis problem is fixed and I made some changes to make it more robust (I think xD have tested at regexr.com and it works fine ;)

but I'm not sure if it requires suffix such as.html and .png in every relative path we want to find here

Thanks!

pombredanne commented 7 years ago

@tina1998612 may be. The only way to know is to write unit tests. Note though that this code is NOT used for now so working on fixing this is very low priority.

armijnhemel commented 7 years ago

Why not use Python's built-in function?

https://docs.python.org/2/library/os.path.html#os.path.isabs

pombredanne commented 7 years ago

@armijnhemel this definitely is a possibility, but you might need to call both posixpath and ntpath rather than the higher level os.path wrapper as the paths in some arbitrary binary may be either posix or Windows irrespective of the os you are running. e.g. os.path would resolve to ntpath on Windows and therefore a function using os.path would likely fail to handle posix paths conventions when analysing an Elf on Windows

armijnhemel commented 7 years ago

Right, I hadn't thought of that. I guess you should also throw splitdrive() and splitunc() in the mix to be complete.

pombredanne commented 7 years ago

@armijnhemel yep, though that should not apply for detecting relative paths, but applies otherwise. Ah! things would be less fun without the joy of supporting non-posix platforms, would they?

armijnhemel commented 7 years ago

So what is the use case of this? In build artefacts I encounter all sorts of paths.

pombredanne commented 7 years ago

@armijnhemel the use case there (which has not yet been implemented) is that given a bunch of strings from a random binary, we should be able to classify each string so extract some meaning. Is this a path? a relative path? some symbol? some literal? such that with that you could get better clues towards some origin detection and/or matching. At second thoughts, this not-implemented-yet attempt may not be needed or may not be the right approach. I had some experiments on using a naive bayes classifier, but I did not pursue as getting a proper large enough training data set was too complicated.