issues
search
TeamHG-Memex
/
html-text
Extract text from HTML
MIT License
130
stars
24
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Deprecate this repostitory and link to https://github.com/zytedata/html-text
#32
co-odw
opened
3 months ago
0
lxml.html.clean is now a separate project
#31
wRAR
opened
6 months ago
4
Consider switching from lxml's clean_html for enhanced security (and possibly performance)
#30
frenzymadness
opened
1 year ago
3
.extract_text returning incorrect format.
#29
hg0428
closed
3 years ago
0
Preserve space inside <pre> tags
#28
mitar
opened
3 years ago
0
add Python 3.8 to CI
#27
kmike
closed
4 years ago
1
extract_text fails with misleading error message when given bytes instead of unicode [py3]
#26
keturn
opened
4 years ago
2
guess_layout does not work on XHTML elements
#25
keturn
opened
4 years ago
1
extract_text does not work on lxml XHTML element
#24
keturn
opened
4 years ago
1
Blank lines created by <br> cannot be parsed correctly
#23
luyuhuang
opened
4 years ago
3
Run the doctests from README.rst as part of Parsel-enabled tests
#22
Gallaecio
closed
4 years ago
2
Support deploying to PyPI automatically from Travis CI
#21
Gallaecio
opened
4 years ago
1
Allow extracting alternative text
#20
Gallaecio
opened
4 years ago
1
Handle exceptions in Cleaner.clean_html
#19
whalebot-helmsman
closed
5 years ago
3
delete requirements_dev.txt file
#18
kmike
closed
5 years ago
2
Do not add spaces after new lines when not guessing punctuation space
#17
Gallaecio
closed
5 years ago
2
Don't always insert spaces around inline tags?
#16
lopuhin
opened
5 years ago
4
Remove parsel dependency
#15
kmike
closed
5 years ago
1
Fix webpage tests
#14
kmike
closed
5 years ago
1
declare Python 3.7 support
#13
kmike
closed
5 years ago
1
fix extraction from nodes without children
#12
kmike
closed
6 years ago
1
Add an option to guess page layout (try to preserve some of the formatting)
#11
kmike
closed
6 years ago
4
support unicode punctuation better
#10
kmike
opened
6 years ago
0
Add guess page layout
#9
Kebniss
closed
6 years ago
8
Use .//text() to extract text from selector
#8
lopuhin
closed
7 years ago
2
Allow passing a selector and extract text only from given selector
#7
lopuhin
closed
7 years ago
0
Handle non-breaking spaces and other special unicode characters
#6
lopuhin
opened
7 years ago
4
improve newline handling
#5
kmike
closed
6 years ago
1
img alt handling
#4
kmike
opened
7 years ago
0
button values?
#3
kmike
opened
7 years ago
1
Fix unwanted joins for inline tags
#2
lopuhin
closed
7 years ago
14
whitespace issues
#1
codinguncut
closed
7 years ago
4