Closed dportabella closed 6 years ago
Thanks @dportabella for the issue. I will take a look at this tomorrow.
I've been looking a little into this, and it appears that we are using the iipc libraries to get the Url. I cannot see anything that would remove diamond brackets in the header.getUrl functions, but it seems strange that it would not exist if that is what the specifications require.
Do you have an example warc I can use to see what happens when warc url data is encased in diamond brackets?
you can produce an archive test.warc.gz as follows:
$ wget --warc-file=test.warc.gz "http://www.example.com"
Here it explains that most libraries do not follow the specification: https://github.com/iipc/warc-specifications/issues/23
Since we're drawing on IIPC libraries, I'm going to suggest that at this time we wait for the libraries to catch up rather than baking this into AUT.
Thanks for sharing that issue, @dportabella. Sounds like it was an error (i.e. here as well https://github.com/iipc/warc-specifications/pull/24) - in practice apart from the example you've provided I haven't seen angled brackets.
Is this causing substantial issues on your end?
Is this causing substantial issues on your end?
No; I remove the angled brackets on ArchiveRecord.getUrl when needed, as a workaround
OK. I think given that this is an IIPC library issue rather than an AUT issue, I'm going to close for now (my gut tells me that given our limited resources, any fix might end up hitting performance in all cases while just fixing the small number of <'ed WARCs in the wild, if that makes sense).
According to https://github.com/iipc/warc-specifications/issues/23, the standard says that WARC-Target-URI should be surrounded by <>, such as in:
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
And this is the result produced by wget:
$ wget --warc-file=test.warc.gz "http://www.example.com"
However, some tools and datasets, such as the CommonCrawl dataset, forgot to use the angle brackets. The
aut
library also does not expect the angle backets. But this is wrong. In order to accept both cases, could it be possible to remove the angle brackets inArchiveRecord.getUrl
in case those exists?