ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
554 stars 77 forks source link

NUL byte in <link> href confuses libxml2-lxml parser #459

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

Via ArchiveBot job b4cobsdfap6j2kzjo3i4jwnsx:

wpull --recursive --no-verbose --no-parent --html-parser libxml2-lxml https://www.e-gov.am/gov-decrees/item/23174/

This recurses to wonderful URLs such as https://www.e-gov.am/gov-decrees/item/23174/1clip_themedata.thmx%22%20rel=%22themeData%22%20/%3E (and it only gets worse from there).

The page contains these three <link> tags with NUL bytes (^@):

<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_filelist.xml" rel="File-List" />
<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_themedata.thmx" rel="themeData" />
<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_colorschememapping.xml" rel="colorSchemeMapping" />

This only happens with the libxml2-lxml parser; the html5lib parser handles it correctly, i.e. does not extract any extra URLs.

Tested on two machines, both with Python 3.6.10. One has lxml 4.4.2 and libxml2 2.9.4 with wpull 2.0.3, the other has lxml 4.6.2 and libxml2 2.9.10 with wpull The Blocking PR 393.