ArchiveTeam / wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
https://www.archiveteam.org/
GNU General Public License v3.0
94 stars 14 forks source link

Segmentation fault (null pointer read and/or write) when reading CDX files #23

Open the-blank-x opened 8 months ago

the-blank-x commented 8 months ago

Steps to reproduce:

  1. Save the following to crashpoc.cdx:
    CDX a b a m s k r M V g u
    http://tillystranstuesdays.com/ 20240113012144 http://tillystranstuesdays.com/ text/html 200 AFQB6VVCWSKWEIAEJADJZAFMXOEGHO57 - - 1358 tillystranstuesdays.warc.gz <urn:uuid:4489ae0e-2e7d-482d-bff6-e86b02a3d719>
  2. Run wget-at --warc-dedup=crashpoc.cdx --warc-file=test https://example.com

Expected behavior: wget-at to download https://example.com into test.warc.gz

Actual behavior:

> ~/gits/wget-lua/src/wget --warc-dedup=crashpoc.cdx --warc-file=test https://example.com
zsh: segmentation fault (core dumped)  ~/gits/wget-lua/src/wget --warc-dedup=crashpoc.cdx --warc-file=test

Additional information:

> ~/gits/wget-lua/src/wget -V
GNU Wget 1.21.3-at.20231215.01 built on linux-gnu.

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls 
+ntlm +opie +psl +ssl/gnutls 

Wgetrc: 
    /usr/local/etc/wgetrc (system)
Locale: 
    /usr/local/share/locale 
Compile: 
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc" 
    -DLOCALEDIR="/usr/local/share/locale" -I. -I../lib -I../lib 
    -I/usr/include/luajit-2.1 -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS 
    -DNDEBUG -ggdb -O0 
Link: 
    gcc -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS -DNDEBUG -ggdb -O0 
    -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lzstd -lz -lpsl -lm -ldl 
    -lluajit-5.1 ../lib/libgnu.a -lunistring 

Backtrace from GDB:

#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:76
#1  0x00005555555c0c1a in xstrdup (string=0x0) at xmalloc.c:338
#2  0x00005555555a270b in store_warc_record (uri=0x5555556126b0 "http://tillystranstuesdays.com/", date=0x0, uuid=0x555555612710 "<urn:uuid:4489ae0e-2e7d-482d-bff6-e86b02a3d719>", 
    digest=0x7fffffffe4b0 "\001`\037V\242\264\225b \004H\006\234\200\254\273\210c\273\277\377\177") at warc.c:1415
#3  0x00005555555a2a7c in warc_process_cdx_line (lineptr=0x5555556125b0 "http://tillystranstuesdays.com/", field_num_original_url=0x2, field_num_checksum=0x5, field_num_record_id=0xa) at warc.c:1520
#4  0x00005555555a2c9e in warc_load_cdx_dedup_file () at warc.c:1591
#5  0x00005555555a2e70 in warc_init () at warc.c:1658
#6  0x0000555555591d4d in main (argc=0x4, argv=0x7fffffffe488) at main.c:2088

store_warc_record is called with a null pointer as its second parameter:

https://github.com/ArchiveTeam/wget-lua/blob/c1fe6093eda544fc7a933f7646225bec1ff4bd8d/src/warc.c#L1520

store_warc_record doesn't check against null pointers, hence a segfault:

https://github.com/ArchiveTeam/wget-lua/blob/c1fe6093eda544fc7a933f7646225bec1ff4bd8d/src/warc.c#L1405-L1422

When I was initially diagnosing this issue, I got a segfault from another area:

https://github.com/ArchiveTeam/wget-lua/blob/c1fe6093eda544fc7a933f7646225bec1ff4bd8d/src/warc.c#L1511-L1519

digest is uninitialised when it is written to, causing a segfault and/or potential memory corruption. (In my case, digest was 0x0, but recompiling with -ggdb -O0 made it become some random writable pointer)

Arkiver2 commented 8 months ago

Thank you for looking into this and the detailed report! I'll push a fix today or tomorrow for this and do a check if there are other cases like this.