coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Migrate document selection from ES Queries to CoherenceBot #13

Closed PeterCiuffetti closed 3 years ago

PeterCiuffetti commented 3 years ago

In CoherenceBot V0, we wrote all documents (including HTML) to ElasticSearch and then proceeded to select certain PDFs for publishing using ES queries developed by Carolina.

This task is to migrate that selection to a configurable keyword match list that occurs in an index filter inside CoherenceBot.

Also explore whether this implementation can accept an update to the selection criteria without restarting the bot.

Log the selections made to make reporting possible on how many documents are getting filtered by the selection critieria.

PeterCiuffetti commented 3 years ago

This is implemented in the plugin index-criteria

It accepts into its config a list of key phrases to reject and the field name(s) to check. It is the last filter in the index plugin and it does the checking after the title and title_english have been selected. So there is no need to check heading and anchor (although these fields can be searched if you need to)

I obtained all the "must not" matches from the queries used in the first two crawls for Africa and DTT.

Initial filter criteria

This resulted in the following filter criteria.

title,title_english=acceptedcomputerprogram
title,title_english=accreditation
title,title_english=advisory board
title,title_english=appel à candidature
title,title_english=assembly
title,title_english=banner
title,title_english=call for
title,title_english=communiqué de presse
title,title_english=curriculum vitae
title,title_english=decision
title,title_english=demande de bourse
title,title_english=draft
title,title_english=executive board
title,title_english=expression of interest
title,title_english=financial statements
title,title_english=guide des services
title,title_english=in the news
title,title_english=job description
title,title_english=junta ejecutiva
title,title_english=list of faculty contacts
title,title_english=listofdocuments
title,title_english=newsletter
title,title_english=obituary
title,title_english=preparations
title,title_english=press release
title,title_english=recent happenings
title,title_english=seminar
title,title_english=status list
title,title_english=summer session
title,title_english=talking points
title,title_english=taller
title,title_english=tax return
title,title_english=technical evaluation
title,title_english=terms of reference
title,title_english=web policy
title,title_english=weekly digest
title,title_english=workshop

Format

field=phrase to check in one field
field1,field2=phrase to check in these two fields

Any number of comma-separated fields to check can be provided. Only string fields can be searched. Numeric and Date fields are not supported. Boolean operations are not supported. Some of these operations can be considered for future enhancements, but this should get us started.

Field names you can check

anchor
content 
domain
heading
host
lang
organization.city
organization.country
organization.country_code
organization.id
organization.name 
organization.region
organization.type
outlinks 
summary
thumbnail
thumbnail.url_archive
title
title_algorithm 
title_english
type
url

Matching Algorithm

Here is the matching algorithm. Because we are not exploiting the full power of elasticsearch, I needed to implement a basic text analysis and normalization on both the fields we are testing and the phrases we are looking for. The following happens to both.

  1. The content is turned into lower case (in a unicode aware fashion).
  2. Punctuation is removed.
  3. Whitespace is normalized (repeated spaces and or newlines are turned into single space)
  4. If a field is multi-valued, each value is checked independently.

And if any one of the provided keyphrases matches the content of any one field it searches, the document is rejected.