VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Question about Page Classifier Threshold #174

Closed pkoloveas closed 5 years ago

pkoloveas commented 5 years ago

Is there a way to change the accepted threshold for the page classification on the focused crawl or does it entirely depend on the Page Classifier model?

I use the SMILE Classifier at my extracted pages I see some pages with a score of 0.4, classified as relevant. I would like to store only the pages that have a score equal to, or higher than 0.6-0.65. Is there a way to do this through the ache.yml file?

The following are the only parameters that I've set in the ache.yml file regarding the link storage.


link_storage.max_pages_per_domain: 100
link_storage.link_strategy.use_scope: false
link_storage.link_strategy.outlinks: true
link_storage.link_classifier.type: LinkClassifierBaseline
link_storage.online_learning.enabled: false
link_storage.link_selector: TopkLinkSelector
aecio commented 5 years ago

You can adjust the threshold using the pageclassifier.yml file. You just need to set the parameter relevance_threshold as follows:

type: smile
parameters:
  features_file: pageclassifier.features
  model_file: pageclassifier.model
  relevance_threshold: 0.65

In this example, relevance_threshold indicates that only pages with score higher 0.65 are considered as relevant.

aecio commented 5 years ago

I updated the documentation page to include this information: https://ache.readthedocs.io/en/latest/page-classifiers.html#pageclassifier-smile. I'm closing this issue, but please let us know or re-open it if there is still any issue.

pkoloveas commented 5 years ago

Thank you, that's exactly what I needed. Also, thank for updating the documentation.