Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags.
See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it?
Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).
fixes #22
Couldn't think of a different approach, since an
img
isn't really a block, so it'll never have a text within it, so no point in generating a different html inget_line_info
functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain textlines
, since that could slice animg
, need to also look at the start/end refs for replaced tags.See more about a
replaced
element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g.video
,embed
etc. ; not sure aboutiframe
and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it?Full list would be a total of 9 replaced elements (or 10 if we also count
input
; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).