archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

remove angle brackets from ArchiveRecord.getUrl #157

Closed dportabella closed 6 years ago

dportabella commented 6 years ago

According to https://github.com/iipc/warc-specifications/issues/23, the standard says that WARC-Target-URI should be surrounded by <>, such as in: WARC-Target-URI: http://www.archive.org/images/logoc.jpg

And this is the result produced by wget: $ wget --warc-file=test.warc.gz "http://www.example.com"

However, some tools and datasets, such as the CommonCrawl dataset, forgot to use the angle brackets. The aut library also does not expect the angle backets. But this is wrong. In order to accept both cases, could it be possible to remove the angle brackets in ArchiveRecord.getUrl in case those exists?

greebie commented 6 years ago

Thanks @dportabella for the issue. I will take a look at this tomorrow.

greebie commented 6 years ago

I've been looking a little into this, and it appears that we are using the iipc libraries to get the Url. I cannot see anything that would remove diamond brackets in the header.getUrl functions, but it seems strange that it would not exist if that is what the specifications require.

Do you have an example warc I can use to see what happens when warc url data is encased in diamond brackets?

dportabella commented 6 years ago

you can produce an archive test.warc.gz as follows: $ wget --warc-file=test.warc.gz "http://www.example.com"

Here it explains that most libraries do not follow the specification: https://github.com/iipc/warc-specifications/issues/23

ianmilligan1 commented 6 years ago

Since we're drawing on IIPC libraries, I'm going to suggest that at this time we wait for the libraries to catch up rather than baking this into AUT.

Thanks for sharing that issue, @dportabella. Sounds like it was an error (i.e. here as well https://github.com/iipc/warc-specifications/pull/24) - in practice apart from the example you've provided I haven't seen angled brackets.

Is this causing substantial issues on your end?

dportabella commented 6 years ago

Is this causing substantial issues on your end?

No; I remove the angled brackets on ArchiveRecord.getUrl when needed, as a workaround

ianmilligan1 commented 6 years ago

OK. I think given that this is an IIPC library issue rather than an AUT issue, I'm going to close for now (my gut tells me that given our limited resources, any fix might end up hitting performance in all cases while just fixing the small number of <'ed WARCs in the wild, if that makes sense).