datalib / libextract

Extract data from websites using basic statistical magic
MIT License
504 stars 45 forks source link

Refactor to 10 liner #33

Closed rodricios closed 9 years ago

rodricios commented 9 years ago

Dropped lots. Going towards solely extracting tabular (repetitive) data.

eugene-eeo commented 9 years ago

I liked that you have used partials instead of closures to configure the pipeline. :+1: for that since it preserves docstrings and doesn't obfuscate the REPL output. Considering the "number extraction" question that you posed, it seems like the best way to go about it is using regexes:

>>> import re
>>> u = re.findall(r'\b((\d+)(\.){,1}(\d+))\b', '$30.00')
>>> [k[0] for k in u]
['30.00']