N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
166 stars 26 forks source link

Generated WARC files fail indexing with OpenWayback cdx-indexer #32

Closed peterk closed 5 years ago

peterk commented 5 years ago

Are you submitting a bug report or a feature request?

bug report

What is the current behavior?

  1. Set up an instance of Openwayback following their Docker tutorial.
  2. Got the latest version of Squidwarc installed and ran a crawl with the sample conf.json config.
  3. Moved the resulting WARC-file to the Openwayback files directory.
  4. Ran the cdx-index command to index the WARC from inside the openwayback Docker container. cdx-indexer /data/files1/instagram.com\!visit_berlin-10-21-2018_1540123477677.warc > /data/index1.cdx
  5. Received error as per below:
Oct 21, 2018 12:08:23 PM org.archive.io.ArchiveReader$ArchiveRecordIterator hasNext
WARNING: Trying skip of failed record cleanup of {reader-identifier=/data/files1/instagram.com!visit_berlin-10-21-2018_1540123477677.warc, absolute-offset=0, WARC-Filename=instagram.com!visit_berlin-10-21-2018_1540123477677.warc, WARC-Date=2018-10-21T12:04:37Z, Content-Length=379, WARC-Record-ID=<urn:uuid:7b817fe0-d529-11e8-8ec6-391ee1fffcc8>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: **Unexpected character 57(Expecting d)**
java.io.IOException: Unexpected character 57(Expecting d)
    at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
    at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:70)
    at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.main(IndexWorker.java:216)

What is the expected behavior?

The captured WARC file should be properly indexed and replayable.

What's your environment?

Squidwarc: 945e4de / 1.0 Openwayback: c49f8e720 / openwayback-core-2.4.0-SNAPSHOT

peterk commented 5 years ago

Attaching the WARC file used above. instagram.com!visit_berlin-10-21-2018_1540123477677.warc.zip

anjackson commented 5 years ago

I ran that WARC through JWATTools, and it reported these 'errors':

#
# Summary of '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
#
    Warc.isValid: false
    Warc.Records: 648
     Warc.Errors: 325
   Warc.Warnings: 0
#
# Detailed diagnosis report for '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
#
Error in '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
       Offset: 0 (0x0)
  Record Type: 'warcinfo'
         Type: INVALID_EXPECTED
       Entity: Trailing newlines
        Value: 1
     Expected: 2
Error in '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
       Offset: 645 (0x285)
  Record Type: 'metadata'
         Type: INVALID_EXPECTED
       Entity: Trailing newlines
        Value: 1
     Expected: 2
Error in '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
       Offset: 26084 (0x65e4)
  Record Type: 'request'
         Type: INVALID_EXPECTED
       Entity: Trailing newlines
        Value: 1
     Expected: 2
Error in '/Users/andy/Downloads/instagram.com!visit_berlin-10-21-2018_1540123477677.warc'
       Offset: 88776 (0x15ac8)
  Record Type: 'request'
         Type: INVALID_EXPECTED
       Entity: Trailing newlines
        Value: 1
     Expected: 2
...continues for some time...

Looking at the first record in an editor that shows the newlines (including the first line of the next):

WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Date: 2018-10-21T12:04:37Z^M
WARC-Filename: instagram.com!visit_berlin-10-21-2018_1540123477677.warc^M
WARC-Record-ID: <urn:uuid:7b817fe0-d529-11e8-8ec6-391ee1fffcc8>^M
Content-Type: application/warc-fields^M
Content-Length: 379^M
^M
software: node-warc/2.1.0^M
format: WARC File Format 1.0^M
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf^M
isPartOf: undefined^M
description: High fidelity preservation using Squidwarc^M
robots: ignore^M
http-header-user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3563.0 Safari/537.36^M
^M
^M
^M
WARC/1.0^M

Versus this from a known-good WARC:

WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Record-ID: <urn:uuid:fbd6cf0a-6160-4550-b343-12188dc05234>^M
WARC-Date: 2014-01-03T03:03:22Z^M
Content-Length: 196^M
Content-Type: application/warc-fields^M
WARC-Filename: live-20140103030321-wwwb-app5.us.archive.org.warc.gz^M
^M
software: LiveWeb Warc Writer 1.0^M
host: wwwb-app5.us.archive.org^M
isPartOf: liveweb^M
format: WARC file version 1.0^M
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf^M
^M
^M
WARC/1.0^M

I'm not 100% sure what's going on, but there appears to be an extra newline at the end of all records except the response records. But it looks like all records end with 3 repeated \r\n (CRLF) newlines, so I suspect while the WARC record length is being recorded correctly for response records, the payload may have a \r\n appended.

The WARCs have no WARC-Payload-Digest so it's hard to tell if the payload matches the original response.

ibnesayeed commented 5 years ago

I am wondering if these issues are also present in WARCs created by WARCreate? /CC @machawk1 @N0taN3rd

N0taN3rd commented 5 years ago

WARCreate is probably also affected by this, but the warcinfo and metadata records looks like a node-warc bug whoops....

Per the extra CRLF in the request records and probably a response record. This is due to how both WARCreate and node-warc serialize HTTP messages which is to keep the "trailing" CRLF after the header fields per rfc7230 section 3 shown below

HTTP-message   = start-line
                *( header-field CRLF )
                CRLF
                [ message-body ]

If no message body (post/put payload or response body) is present then the CRLF shown above would appear to be a third CRLF of the WARC record separator (CRLF x2) but it is not. WARC record Content-Length values are calculated using the above assumption and do not include the record separator.

If the message body (post/put payload or response body) includes a trailing CRLF

HTTP-message   = start-line
                *( header-field CRLF )
                CRLF
                message-body
                CRLF

we run into the same problem as before, the trailing CRLF appears to be apart of the record separator (CRLF x2) because both node-warc and WARCreate do not strip the trailing CRLF if they are present before inserting the record separator.

OpenWayback and JWATTools take issue with the fact that node-warc and WARCreates block fields (section 4 of the warc 1.0 and 1.1 spec) shown below

warc-file    = 1*warc-record
warc-record  = header CRLF
               block CRLF CRLF
header       = version warc-fields
version      = "WARC/1.1" CRLF
warc-fields  = *named-field CRLF
block        = *OCTET
OCTET        = <any 8-bit sequence of data>

may include trailing CRLFs.

IMO biggest issue is that OpenWayback and JWATTools expect sanitized block values not the raw data as seen by a browser. But if this node-warcs and WARCreates handling of this completely wrong then let me know and I'll fix it.

machawk1 commented 5 years ago

@anjackson Which switches are you sending to get the detailed error output in jwattools? In v0.6.6:

$ ./jwattools.sh test -e instagram.com\!visit_berlin-10-21-2018_1540123477677.warc
Showing errors: true
Validate digest: true
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.                                                
Output Thread stopped.
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 0
 Arc files: 0
Warc files: 1
    Errors: 325
  Warnings: 0
RuntimeErr: 0
   Skipped: 0
      Time: 00:00:02 (2054 ms.)
TotalBytes: 13.4 mb
  AvgBytes: 6.7 mb/s
INVALID_EXPECTED: 325
Trailing newlines: 325
$
N0taN3rd commented 5 years ago

@machawk1 I think -e is what your looking for

jwattools test [-beilx] [-w THREADS] [-a<yyyyMMddHHmmss>] <filepattern>...

test one or more ARC/WARC/GZip files

options:

 -a<yyyyMMddHHmmss>  only test files with last-modified after <yyyyMMddHHmmss>
 -b                  tag/rename files with errors/warnings (*.bad)
 -e                  show errors
 -i --ignore-digest  skip digest calculation and validation
 -l                  relaxed URL URI validation
 -x                  to validate text/xml payload (eg. mets)
 -w<x>               set the amount of worker thread(s) (defaults to 1)
anjackson commented 5 years ago

@machawk1 It's rather clumsy: you need the -e flag but the details are placed in the i.out file.

@N0taN3rd I'm not sure what's going on. It seems the responses you are getting are appending newlines after the Content-length, and then OWB relies on the Content-length to consume the record rather than skimming to the next record header?

(FWIW I can't see any trailing newline coming from the server for e.g. https://www.instagram.com/static/bundles/base/sprite_core.png/fa9f36710aa4.png. )

peterk commented 5 years ago

Sorry for the late reply but I am still getting the same error with what I think is a newer version of node-warc, 3.0.0, (rebuilt everything today). Attaching a new file I used for the test.

instagram.com_visit_berlin-11-19-2018_1542663033873.warc.zip

WARNING: Trying skip of failed record cleanup of {reader-identifier=/data/files1/instagram_old.warc, absolute-offset=0, WARC-Filename=instagram.com!visit_berlin-10-27-2018_1540650893910.warc, WARC-Date=2018-10-27T14:34:53Z, Content-Length=363, WARC-Record-ID=<urn:uuid:78138440-d9f5-11e8-a860-ffc684cdf206>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character 57(Expecting d)
java.io.IOException: Unexpected character 57(Expecting d)
    at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
    at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:70)
    at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
    at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
    at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.hasNext(ArchiveReaderCloseableIterator.java:37)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
    at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
    at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
    at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
N0taN3rd commented 5 years ago

@peterk thanks for reporting back. I have resolved the issue with openwayback and WARCs created using Squidwarc (via node-warc) Attached below are the steps I used to verify the issue is no longer present (using jwat-tools and openwayback) jwat-good owb-good

peterk commented 5 years ago

That sounds excellent! Thank you for all your work in this.

peterk commented 5 years ago

Tested on a couple of files and this works as expected now. Again, thank you!