DjangoAdminHackers / django-linkcheck

An app that will analyze and report on links in any model that you register with it. Links can be bare (urls or image and file fields) or embedded in HTML (linkcheck handles the parsing). It's fairly easy to override methods of the Linkcheck object should you need to do anything more complicated (like generate URLs from slug fields etc).
BSD 3-Clause "New" or "Revised" License
75 stars 26 forks source link

Fix error when checking video links with hash anchors #151

Closed timobrembeck closed 1 year ago

timobrembeck commented 1 year ago

Fix NotImplementedError/AssertionError when checking video links with hash anchors

I struggled to find/generate a smaller video file for this test case, I ended up with a small excerpt from this stock video, do you think it's already sufficient to not include this file in the MANIFEST.in, or should I explicitly exclude it? Or do you think the file should be shipped with the package so the tests are complete?

Fixes: #150

claudep commented 1 year ago

Did you try with the .mp4 as an empty file? Does it change anything to testing?

timobrembeck commented 1 year ago

Did you try with the .mp4 as an empty file? Does it change anything to testing?

Yes, then the parsing works just fine and it does not throw the exception, but only shows 200 OK, broken external hash anchor, because it thinks the document is empty and the anchor target is missing...

claudep commented 1 year ago

I wonder if we may check the response content type and not try to run _check_anchor on non-text content. We can still keep the new exception catching you introduced (as content-type can be faked), but that would avoid most of the time trying to parse big binary content.

claudep commented 1 year ago

Did you try with the .mp4 as an empty file? Does it change anything to testing?

Yes, then the parsing works just fine and it does not throw the exception, but only shows 200 OK, broken external hash anchor, because it thinks the document is empty and the anchor target is missing...

Then maybe a minimal binary file, even not looking like a real mp4?

timobrembeck commented 1 year ago

I wonder if we may check the response content type and not try to run _check_anchor on non-text content. We can still keep the new exception catching you introduced (as content-type can be faked), but that would avoid most of the time trying to parse big binary content.

Yeah, good idea, I will adapt the code. :+1:

Then maybe a minimal binary file, even not looking like a real mp4?

I tried to generate different random binary files of varying sizes, with no luck. So it seems to be something format-specific about some mp4 files. Would have to dig deeper into the traceback to understand which byte sequence exactly causes the error, but don't think it's worth the effort though. If you want, I can remove the test case again.

claudep commented 1 year ago

Then maybe a minimal binary file, even not looking like a real mp4?

I tried to generate different random binary files of varying sizes, with no luck. So it seems to be something format-specific about some mp4 files. Would have to dig deeper into the traceback to understand which byte sequence exactly causes the error, but don't think it's worth the effort though. If you want, I can remove the test case again.

Found that starting the binary string with <![ (e.g. something like b'<![x02\x00\xa0\xcc') is reproducing the crash. I'd rather use that than a 3Mb file.

timobrembeck commented 1 year ago

Found that starting the binary string with <![ (e.g. something like b'<![x02\x00\xa0\xcc') is reproducing the crash. I'd rather use that than a 3Mb file.

Awesome, thanks for figuring that out! :+1: