datalib / libextract

Extract data from websites using basic statistical magic
MIT License
504 stars 45 forks source link

Quantifiers #15

Closed rodricios closed 9 years ago

rodricios commented 9 years ago

Moving node_text_length and children_counter to their own submodule will hopefully do two things:

  1. Standardize - protocol is simple:
    • input: node (lxml.html.HtmlElement and sometimes *._ElementTree)
    • output: quantity or quantities (numerical or collections.Counter)
  2. Modularize - the bigger goal of creating a truly functional lib is still pretty unclear, but this branch will hopefully simplify a user's experience in creating custom pipelines:

Consider:

from libextract.html.tabular import children_counter
from libextract.html.article import node_text_length

vs.

from libextract.quantifiers import children_counter, text_length

Partially addresses issue #1

eugene-eeo commented 9 years ago

Man, this is good. I'll review the code before merging.