Moving node_text_length and children_counter to their own submodule will hopefully do two things:
Standardize - protocol is simple:
input: node (lxml.html.HtmlElement and sometimes *._ElementTree)
output: quantity or quantities (numerical or collections.Counter)
Modularize - the bigger goal of creating a truly functional lib is still pretty unclear, but this branch will hopefully simplify a user's experience in creating custom pipelines:
Consider:
from libextract.html.tabular import children_counter
from libextract.html.article import node_text_length
vs.
from libextract.quantifiers import children_counter, text_length
Moving
node_text_length
andchildren_counter
to their own submodule will hopefully do two things:lxml.html.HtmlElement
and sometimes*._ElementTree
)numerical
orcollections.Counter
)Consider:
vs.
Partially addresses issue #1