Closed lintool closed 4 years ago
There are some difference between ClueWeb12's WARCs and CommonCrawl's WARCs.
ClueWeb's WARC example:
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:51:20Z
WARC-TREC-ID: clueweb12-0000tw-00-00013
WARC-Target-URI: http://cheapcosthealthinsurance.com/2012/01/25/what-is-hiv-aids/
WARC-Payload-Digest: sha1:YZUOJNSUMFG3JVUKM6LBHMRMMHWLVNQ4
WARC-IP-Address: 100.42.59.15
WARC-Record-ID: <urn:uuid:74edc71e-a881-4942-81fc-a40db4bf1fb9>
Content-Type: application/http; msgtype=response
Content-Length: 71726
HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:51:22 GMT
Server: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.32
X-Powered-By: PHP/5.2.17
X-Pingback: http://cheapcosthealthinsurance.com/xmlrpc.php
Link: <http://cheapcosthealthinsurance.com/?p=711>; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
Health and Insurance uses HeatMap Ads Theme Pro v5.0 (http://heatmaptheme.com)
-->
CommonCrawl's WARC example:
WARC/1.0
WARC-Type: request
WARC-IP-Address: 145.239.193.102
WARC-Record-ID: <urn:uuid:5f5acc82-6690-4d55-b28f-04c8c5fc197d>
Content-Length: 369
WARC-Date: 2020-01-01T02:39:37Z
WARC-Target-URI: https://www.telez.fr/actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/
Content-Type: application/http; msgtype=request
WARC-Block-Digest: sha1:YANR5P3U526KQCWHA7DCDXZPGOWNGYQC
GET /actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/ HTTP/1.1
User-Agent: CCBot/3.0 (http://commoncrawl.org/faq/; info@commoncrawl.org)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Host: www.telez.fr
Connection: Keep-Alive
Accept-Encoding: gzip
WARC/1.0
WARC-Record-ID: <urn:uuid:e8461a92-520d-4aeb-9a69-f127b2f90d9d>
Content-Length: 186029
WARC-Date: 2020-01-01T02:39:37Z
WARC-Type: response
WARC-IP-Address: 145.239.193.102
WARC-Target-URI: https://www.telez.fr/actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:ETUGLKV76EZAANXPJE7JREOBURSQ2KAT
WARC-Block-Digest: sha1:MKV4YOIII6TECXFHICCHUW5WWFHJO75T
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 01 Jan 2020 02:39:37 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Content-Length: 38517
Content-Length: 185409
Connection: keep-alive
Link: <https://www.telez.fr/wp-json/>; rel="https://api.w.org/"
Link: <https://www.telez.fr/?p=4923535>; rel=shortlink
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Crawler-Content-Encoding: gzip
X-UA-Device:
X-Proxy: front2
X-Cacheable: YES:Forced
X-Varnish: 119030879 132402322
Age: 28706
Via: 1.1 varnish (Varnish/5.0)
Vary: Accept-Encoding
Accept-Ranges: bytes
<!doctype html>
<!--[if lt IE 7 ]>
<html class="no-js ie lte-ie9 lte-ie8 lte-ie7 ie6" lang="fr-FR"> <![endif]-->
<!--[if IE 7 ]>
<html class="no-js ie lte-ie9 lte-ie8 lte-ie7 ie7" lang="fr-FR"> <![endif]-->
<!--[if IE 8 ]>
<html class="no-js ie lte-ie9 lte-ie8 ie8" lang="fr-FR"> <![endif]-->
<!--[if IE 9 ]>
<html class="no-js ie lte-ie9 ie9" lang="fr-FR"> <![endif]-->
<!--[if !(IE)]><! -->
<html class="fonts-loading no-js" lang="fr-FR"><!--<![endif]-->
<head>
Hey @MXueguang I just merged https://github.com/castorini/anserini/pull/1260
We should compare WET (from that PR) w/ WARC.
Hi @MXueguang are we ready to close this?
@MXueguang shared with me some prelim results, I'll let him share them here. I feel like we are done with this bit! Since indexing/searching seemingly works.
An search example using the example topic from https://trec-health-misinfo.github.io
>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher('/store/collections/trec-misinfo/warc_index/')
>>> hits = searcher.search('ibuprofen COVID-19')
>>> for i in range(0, 10):
... print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
...
1 <urn:uuid:a9a35d9b-989e-4a02-b933-70ad24b9e712> 9.51370
2 <urn:uuid:d98bdbaa-b5fa-4e39-b3a7-e4d6269219f4> 9.51370
3 <urn:uuid:c17ed5a6-d96a-471b-b6c7-226192058b8f> 9.47510
4 <urn:uuid:0b395925-b0b4-45c7-9158-34e91b9e9c64> 9.47280
5 <urn:uuid:f14fae55-1054-4d61-bda2-adca5ef7b43d> 9.45360
6 <urn:uuid:edff9bcf-2130-4c96-8aed-fba1c64ffc17> 9.44440
7 <urn:uuid:998efb65-ca44-4c26-bec5-c0d2c962ca75> 9.41200
8 <urn:uuid:c620dfe6-c9b8-45a2-bfee-969f5b1151f4> 9.40180
9 <urn:uuid:9e829fd6-712f-4249-b08c-ae3225e52647> 9.39590
10 <urn:uuid:e8e5c59b-8709-473a-ab0e-e06461b6c77f> 9.38050
Part of hits[0].contents
:
'No concrete evidence ibuprofen makes COVID-19 worse: health experts | TheSpec.com No concrete evidence ibuprofen makes COVID-19 worse: health experts Living 09:22 AM by Cassandra Szklarski The Canadian Press TORONTO — Canadian health officials are trying to calm fears that anti-inflammatory drugs such as ibuprofen can worsen COVID-19 symptoms by stressing the lack of concrete evidence. Debate over whether ibuprofen products such as Advil should be bypassed for acetaminophen medications including Tylenol continues to rage among many people confused by conflicting reports spreading online. Alberta\'s medical health office offered assurances Thursday on Twitter, stating "there is no strong evidence to indicate that ibuprofen could worsen COVID-19 symptoms beyond the usual known side effects." "Until more information is available, people may wish to take paracetamol/acetaminophen to treat COVID-19 symptoms, unless advised otherwise by their doctor," said the account, run by public health staff on behalf of chief medical officer Dr. Deena Hinshaw. The executive vice president and chief pharmacy officer of Ontario Pharmacists Association also said Thursday there was not enough evidence to avoid the common painkiller but nevertheless suggested concerned patients use acetaminophen instead. "It\'s sometimes good to err on the side of caution because we can\'t disapprove what that statement was," said Allan Malek, referring to a weekend tweet from France\'s health ministry that sparked the controversy. "Because there is another alternative — acetaminophen, which are the Tylenol-based products — that would be a good alternative in terms of treating fever and pain that may come along with positive symptoms of COVID-19.
Compared the contents of about 40 docs in WET index v.s. WARC index. It doesn't show significant difference between the two indexes built from the two formats.
Some observed minor differences:
/xa0
while it removed in WET index/n
while it kept in WET index<img>
was kept in WARC index, and WET index removed that.Hard to tell which one is better. We will move forward with WARC index for now to keep the process more straightforward.
Sounds good @MXueguang - thanks for your contribution!
Hi! I am sorry to reopen again this issue, but I have a doubt related to CommonCrawlWarcCollection
. I have just indexed the TREC 2020 Health Misinformation Track dataset using the following command:
nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/lucene-index.ccwc_full.pos+docvectors+raw -generator DefaultLuceneDocumentGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.ccwc_full.pos+docvectors+rawdocs &
Is there a way to recover the WARC-Target-URI
field using a doc id after this indexing process? I would appreciate any help.
Marcos
Hi @MarcosFP977,
I am afraid that we didn't keep the WARC-Target-URI
during indexing in current version, but we can add this indeed. I will try to add it and get back to you in one or two days.
Xueguang
Hi! I think it could be a helpful field to retrieve the original source of crawled news. If you need any help, do not hesitate to ask. Marcos
Hi Marcos,
Sorry for a bit late reply.
We updated in newest master
branch.
warc_date
and warc_url
get added to fields of the documents.
You can now get the date
and url
by:
hits[0].lucene_document.get("url")
(I am assuming you are using pyserini
to retrieve)
No worries :) You are right I am using Pyserini. Thanks for your response and for helping with this.
Hi,
I am sorry for bothering you again, but I was not able to index warc_date
and warc_url
. This is the command that I was using for the indexing process:
nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/TREC_BM25 -generator DefaultLuceneDocumentGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.TREC.BM25 &
I guess that I might be using the wrong collection or I may be missing some parameter. Since when I apply Pyserini search, doc.lucene_document().get("url")
does not return anything.
Thanks,
Marcos
Please use WarcGenerator
instead of DefaultLuceneDocumentGenerator
.
I forgot to mention that. Apologies!
i.e.
nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/TREC_BM25 -generator WarcGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.TREC.BM25 &
No worries! Thank you
The collection for the TREC 2020 Health Misinformation Track appears to be common crawl WARCS.
ClueWeb12 is distributed as WARCs also: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb12Collection.java
See if we can adapt code there? Or maybe common crawl has their own APIs?
Call this
CommonCrawlCollection
.