Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

Good recursive html, text parser #175

Open transfluxus opened 7 years ago

transfluxus commented 7 years ago

started long ago. different repo? recursive crawling with status json files about recursive process of a folder and html download needs proper text grab which doesnt create duplicates, filters nicely, and maybe also creates labels from header tags, ...

schwittlick commented 7 years ago

this module of pattern.en looks good and does exactly what you are talking about: http://www.clips.ua.ac.be/pages/pattern-web#plaintext

there's even a proper crawler in that module, that grabs the texts of all websites that are linked to.

why don't you set up a test to crawl a couple of websites and document the outcome here. i think putting a command line usable module to crawl websites in here would be good for now: https://github.com/mrzl/ECO/tree/master/src/python/pdf2text

i could imagine a tool that works like this:

cd src/python/pdf2text/
workon pdf2text # contains pattern.en already and is a python2.7 venv
python html_grab.py --method grab_single/crawl --url url_to_download --output_path *path_to_folder*
# outputting limited logs about which url is being crawled, where it's saved and how many lines of text that url contained

it would be good to use the textparser i wrote from that same module directly, in order to separate good from faulty lines.

just throw in one long string into the parse() function of the TextParser: https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/batch_postprocess_text.py#L70

afterwards you can access the valid sentences like this: https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/batch_postprocess_text.py#L77 and the invalid sentences like this: https://github.com/mrzl/ECO/blob/master/src/python/pdf2text/batch_postprocess_text.py#L82

schwittlick commented 7 years ago

maybe test with Reddit: #154

transfluxus commented 7 years ago

Was getting into beautiful soup till now. Gonna check if pattern is smarter. Soup is quite raw/basic. Reddit k. Good for slang... Was Going for sacred-texts.com and English Gutenberg project

skeshavar commented 7 years ago

Is it possible to work on this issue? I am interested in it.

transfluxus commented 7 years ago

@skeshavar our target was sacred-text.org which has a strange href format. in general pandas or https://pypi.python.org/pypi/html2text do an ok job