coherentdigital / coherencebot

Apache Nutch is an extensible and scalable web crawler
https://nutch.apache.org/
Apache License 2.0
0 stars 0 forks source link

Remove file extension and common prefixes from titles #1

Open avorio opened 3 years ago

avorio commented 3 years ago

Examples:

AED_L.Amer_COVER 5.0.qxd
Microsoft Word - Childcare_Ghana_Aug02.doc

I suggest that we remove anything ending in ".abc" or ".abcd".

I'd also remove prefixes like:

Microsoft Word - 
PeterCiuffetti commented 3 years ago

This has been fixed though I will be checking whether it will need to recognize other prefixes with the first round of CoherenceBox V1 crawling.

As far as suffix removal I cover every practical text input extension type. I also handle situations where there may be querystrings after the extension. Something the earlier code did not consider.

Prefix noise is mainly covered by the existing IndexReplace plugin. These use a series of find-and-replace regex patterns.

Over time we could also handle additional extension removal using regex's, but as the list of extensions is rather limited and predicable, it was easier to eliminate this with code in the IndexCriteria plugin rather than via regexes read from the config for IndexReplace.

One implication of using two techniques for title cleanup (IndexReplace and IndexCriteria) is that the scrubbing happens in two different phases of the index pipeline. IndexReplace comes early in the indexing phase, IndexCriteria comes last in the index phase. (IndexCriteria's main job is deciding whether to keep the PDFs we've found). By delaying some of the title cleanup to the end, the IndexCriteria plugin can perhaps use these extension clues as another factor in deciding whether to keep or reject the document. It doesn't use them in its current logic, but it has the option given that the clues are still there until the end.