Write ingester for TREC 2020 Health Misinformation Track

lintool commented 4 years ago

The collection for the TREC 2020 Health Misinformation Track appears to be common crawl WARCS.

ClueWeb12 is distributed as WARCs also: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb12Collection.java

See if we can adapt code there? Or maybe common crawl has their own APIs?

Call this CommonCrawlCollection.

MXueguang commented 4 years ago

There are some difference between ClueWeb12's WARCs and CommonCrawl's WARCs.

The fields of WARC header has some different, biggest different is ClueWeb12 ClueWeb09 was using "WARC-TREC-ID" as doc id. We should use "WARC-Record-ID" here
CommonCrawl contains WARC request while ClueWeb not.
- a content given by CW is:
- a content given by CC is:

ClueWeb's WARC example:

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:51:20Z
WARC-TREC-ID: clueweb12-0000tw-00-00013
WARC-Target-URI: http://cheapcosthealthinsurance.com/2012/01/25/what-is-hiv-aids/
WARC-Payload-Digest: sha1:YZUOJNSUMFG3JVUKM6LBHMRMMHWLVNQ4
WARC-IP-Address: 100.42.59.15
WARC-Record-ID: <urn:uuid:74edc71e-a881-4942-81fc-a40db4bf1fb9>
Content-Type: application/http; msgtype=response
Content-Length: 71726

HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:51:22 GMT
Server: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.32
X-Powered-By: PHP/5.2.17
X-Pingback: http://cheapcosthealthinsurance.com/xmlrpc.php
Link: <http://cheapcosthealthinsurance.com/?p=711>; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--
Health and Insurance uses HeatMap Ads Theme Pro v5.0 (http://heatmaptheme.com)
-->

CommonCrawl's WARC example:

WARC/1.0
WARC-Type: request
WARC-IP-Address: 145.239.193.102
WARC-Record-ID: <urn:uuid:5f5acc82-6690-4d55-b28f-04c8c5fc197d>
Content-Length: 369
WARC-Date: 2020-01-01T02:39:37Z
WARC-Target-URI: https://www.telez.fr/actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/
Content-Type: application/http; msgtype=request
WARC-Block-Digest: sha1:YANR5P3U526KQCWHA7DCDXZPGOWNGYQC

GET /actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/ HTTP/1.1
User-Agent: CCBot/3.0 (http://commoncrawl.org/faq/; info@commoncrawl.org)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Host: www.telez.fr
Connection: Keep-Alive
Accept-Encoding: gzip

WARC/1.0
WARC-Record-ID: <urn:uuid:e8461a92-520d-4aeb-9a69-f127b2f90d9d>
Content-Length: 186029
WARC-Date: 2020-01-01T02:39:37Z
WARC-Type: response
WARC-IP-Address: 145.239.193.102
WARC-Target-URI: https://www.telez.fr/actus-tv/demain-nous-appartient-en-avance-resume-de-lepisode-629-de-mercredi-1er-janvier/
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:ETUGLKV76EZAANXPJE7JREOBURSQ2KAT
WARC-Block-Digest: sha1:MKV4YOIII6TECXFHICCHUW5WWFHJO75T

HTTP/1.1 200 OK
Server: nginx
Date: Wed, 01 Jan 2020 02:39:37 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Content-Length: 38517
Content-Length: 185409
Connection: keep-alive
Link: <https://www.telez.fr/wp-json/>; rel="https://api.w.org/"
Link: <https://www.telez.fr/?p=4923535>; rel=shortlink
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Crawler-Content-Encoding: gzip
X-UA-Device: 
X-Proxy: front2
X-Cacheable: YES:Forced
X-Varnish: 119030879 132402322
Age: 28706
Via: 1.1 varnish (Varnish/5.0)
Vary: Accept-Encoding
Accept-Ranges: bytes

<!doctype html>
<!--[if lt IE 7 ]>
<html class="no-js ie lte-ie9 lte-ie8 lte-ie7 ie6" lang="fr-FR"> <![endif]-->
<!--[if IE 7 ]>
<html class="no-js ie lte-ie9 lte-ie8 lte-ie7 ie7" lang="fr-FR"> <![endif]-->
<!--[if IE 8 ]>
<html class="no-js ie lte-ie9 lte-ie8 ie8" lang="fr-FR"> <![endif]-->
<!--[if IE 9 ]>
<html class="no-js ie lte-ie9 ie9" lang="fr-FR"> <![endif]-->
<!--[if !(IE)]><! -->
<html class="fonts-loading no-js" lang="fr-FR"><!--<![endif]-->
<head>

lintool commented 4 years ago

Hey @MXueguang I just merged https://github.com/castorini/anserini/pull/1260

We should compare WET (from that PR) w/ WARC.

lintool commented 4 years ago

Hi @MXueguang are we ready to close this?

ronakice commented 4 years ago

@MXueguang shared with me some prelim results, I'll let him share them here. I feel like we are done with this bit! Since indexing/searching seemingly works.

MXueguang commented 4 years ago

An search example using the example topic from https://trec-health-misinfo.github.io

>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher('/store/collections/trec-misinfo/warc_index/')
>>> hits = searcher.search('ibuprofen COVID-19')
>>> for i in range(0, 10):
...     print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
...
 1 <urn:uuid:a9a35d9b-989e-4a02-b933-70ad24b9e712> 9.51370
 2 <urn:uuid:d98bdbaa-b5fa-4e39-b3a7-e4d6269219f4> 9.51370
 3 <urn:uuid:c17ed5a6-d96a-471b-b6c7-226192058b8f> 9.47510
 4 <urn:uuid:0b395925-b0b4-45c7-9158-34e91b9e9c64> 9.47280
 5 <urn:uuid:f14fae55-1054-4d61-bda2-adca5ef7b43d> 9.45360
 6 <urn:uuid:edff9bcf-2130-4c96-8aed-fba1c64ffc17> 9.44440
 7 <urn:uuid:998efb65-ca44-4c26-bec5-c0d2c962ca75> 9.41200
 8 <urn:uuid:c620dfe6-c9b8-45a2-bfee-969f5b1151f4> 9.40180
 9 <urn:uuid:9e829fd6-712f-4249-b08c-ae3225e52647> 9.39590
10 <urn:uuid:e8e5c59b-8709-473a-ab0e-e06461b6c77f> 9.38050

Part of hits[0].contents:

'No concrete evidence ibuprofen makes COVID-19 worse: health experts | TheSpec.com No concrete evidence ibuprofen makes COVID-19 worse: health experts Living 09:22 AM by Cassandra Szklarski The Canadian Press TORONTO — Canadian health officials are trying to calm fears that anti-inflammatory drugs such as ibuprofen can worsen COVID-19 symptoms by stressing the lack of concrete evidence. Debate over whether ibuprofen products such as Advil should be bypassed for acetaminophen medications including Tylenol continues to rage among many people confused by conflicting reports spreading online. Alberta\'s medical health office offered assurances Thursday on Twitter, stating "there is no strong evidence to indicate that ibuprofen could worsen COVID-19 symptoms beyond the usual known side effects." "Until more information is available, people may wish to take paracetamol/acetaminophen to treat COVID-19 symptoms, unless advised otherwise by their doctor," said the account, run by public health staff on behalf of chief medical officer Dr. Deena Hinshaw. The executive vice president and chief pharmacy officer of Ontario Pharmacists Association also said Thursday there was not enough evidence to avoid the common painkiller but nevertheless suggested concerned patients use acetaminophen instead. "It\'s sometimes good to err on the side of caution because we can\'t disapprove what that statement was," said Allan Malek, referring to a weekend tweet from France\'s health ministry that sparked the controversy. "Because there is another alternative — acetaminophen, which are the Tylenol-based products — that would be a good alternative in terms of treating fever and pain that may come along with positive symptoms of COVID-19.

MXueguang commented 4 years ago

Compared the contents of about 40 docs in WET index v.s. WARC index. It doesn't show significant difference between the two indexes built from the two formats.

Some observed minor differences:

WARC index keeps /xa0 while it removed in WET index
WARC index removed /n while it kept in WET index
Some image tag <img> was kept in WARC index, and WET index removed that.

Hard to tell which one is better. We will move forward with WARC index for now to keep the process more straightforward.

lintool commented 4 years ago

Sounds good @MXueguang - thanks for your contribution!

MarcosFP97 commented 4 years ago

Hi! I am sorry to reopen again this issue, but I have a doubt related to CommonCrawlWarcCollection. I have just indexed the TREC 2020 Health Misinformation Track dataset using the following command:

nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/lucene-index.ccwc_full.pos+docvectors+raw -generator DefaultLuceneDocumentGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.ccwc_full.pos+docvectors+rawdocs &

Is there a way to recover the WARC-Target-URI field using a doc id after this indexing process? I would appreciate any help. Marcos

MXueguang commented 4 years ago

Hi @MarcosFP977,

I am afraid that we didn't keep the WARC-Target-URI during indexing in current version, but we can add this indeed. I will try to add it and get back to you in one or two days.

Xueguang

MarcosFP97 commented 4 years ago

Hi! I think it could be a helpful field to retrieve the original source of crawled news. If you need any help, do not hesitate to ask. Marcos

MXueguang commented 4 years ago

Hi Marcos,

Sorry for a bit late reply. We updated in newest master branch. warc_date and warc_url get added to fields of the documents. You can now get the date and url by: hits[0].lucene_document.get("url") (I am assuming you are using pyserini to retrieve)

MarcosFP97 commented 4 years ago

No worries :) You are right I am using Pyserini. Thanks for your response and for helping with this.

MarcosFP97 commented 4 years ago

Hi, I am sorry for bothering you again, but I was not able to index warc_date and warc_url. This is the command that I was using for the indexing process:

nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/TREC_BM25 -generator DefaultLuceneDocumentGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.TREC.BM25 &

I guess that I might be using the wrong collection or I may be missing some parameter. Since when I apply Pyserini search, doc.lucene_document().get("url") does not return anything. Thanks, Marcos

MXueguang commented 4 years ago

Please use WarcGenerator instead of DefaultLuceneDocumentGenerator. I forgot to mention that. Apologies!

i.e. nohup sh target/appassembler/bin/IndexCollection -collection CommonCrawlWarcCollection -input ~/TREC_files -index indexes/TREC_BM25 -generator WarcGenerator -threads 44 -storePositions -storeDocvectors -storeRaw >& logs/log.TREC.BM25 &

MarcosFP97 commented 4 years ago

No worries! Thank you

castorini / anserini

Write ingester for TREC 2020 Health Misinformation Track #1259