WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

V3.1.3: update cdx format #83

Closed hannakoppelaar closed 1 year ago

hannakoppelaar commented 1 year ago

Fixes issue #70 and allows the user to specify the desired cdx output format.

hannakoppelaar commented 1 year ago

Forgot to add documentation for the cdxIndexer.format option, but now this PR is really ready for review :)

leefrank9527 commented 1 year ago

Hi @hannakoppelaar, The parameter "String urlkey" of a line of the CDX is not passed from CDXIndexer explicitly. A piece of CDX file looks like this:

com,google-analytics)/analytics.js 20221106221337 https://www.google-analytics.com/analytics.js text/javascript 200 VZD42NGL3YEBUSGX7EX4QCVPA2QTQEMT - - 51199 249944 IAH-20221106221320901-00000-12216~dev~8443.warc
com,imrworldwide,secure-nz)/v60.js 20221106221338 https://secure-nz.imrworldwide.com/v60.js text/html 301 DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2 https://cdn-gl.imrworldwide.com:443/v60.js - 700 303332 IAH-20221106221320901-00000-12216~dev~8443.warc

Would you mind take a look is it the expected contents?

hannakoppelaar commented 1 year ago

Was this file webcurator-store/src/test/java/org/webcurator/core/store/warc/14055/1/_resource/00000000.jdb intended to be part of the commits?

No, it was being generated by a test, I've removed it and added it to .gitignore

hannakoppelaar commented 1 year ago

Hi @hannakoppelaar, The parameter "String urlkey" of a line of the CDX is not passed from CDXIndexer explicitly. A piece of CDX file looks like this:

com,google-analytics)/analytics.js 20221106221337 https://www.google-analytics.com/analytics.js text/javascript 200 VZD42NGL3YEBUSGX7EX4QCVPA2QTQEMT - - 51199 249944 IAH-20221106221320901-00000-12216~dev~8443.warc
com,imrworldwide,secure-nz)/v60.js 20221106221338 https://secure-nz.imrworldwide.com/v60.js text/html 301 DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2 https://cdn-gl.imrworldwide.com:443/v60.js - 700 303332 IAH-20221106221320901-00000-12216~dev~8443.warc

Would you mind take a look is it the expected contents?

Hi @leefrank9527, the default CDX format that we generate is N b a m s k r M S V g, which contains this sequence of fields:

N: massaged url b: date a: original url m: mime type of original document s: response code k: new style checksum r: redirect M: meta tags S: compressed record size V: compressed arc file offset g: file name

(See https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/)

It seems to me that those two lines conform to this format or am I missing something? Note that the redirect field is missing in the first line, but that's not unusual.

leefrank9527 commented 1 year ago

something Hi @hannakoppelaar Thank you for the response. It looks good to me now.