N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

httpHeaders Set-Cookie is single string #33

Open Austinb opened 4 years ago

Austinb commented 4 years ago

Where there are multiple Set-Cookie headers in a server response from a WARC record the value of httpHeaders.Set-Cookie is always the last one in the list. This should be returned as an array of the Set-Cookie headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.

Example WARC record (minus the content block):

WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4

GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip

WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip

Response from console.log(record.httpHeaders); when used in the record callback:

{ Server: 'nginx',
  Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
  'Content-Type': 'text/html; charset=UTF-8',
  'X-Crawler-Transfer-Encoding': 'chunked',
  Connection: 'keep-alive',
  Vary: 'Accept-Encoding',
  'Set-Cookie':
   'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
  'X-XSS-Protection': '1; mode=block',
  'X-Content-Type-Options': 'nosniff',
  'X-Nginx-Cache-Status': 'BYPASS',
  'X-Server-Powered-By': 'Engintron',
  'X-Crawler-Content-Encoding': 'gzip' }
BubuAnabelas commented 4 years ago

I think that is done in the following lines (specifically 236): https://github.com/N0taN3rd/node-warc/blob/be3897198847fa49023ca4d09f09c0010dd98540/lib/warcRecord/warcContentParsers.js#L220-L252

Maybe you could fix it and do a PR for it to get fixed.