Closed GiliardGodoi closed 1 year ago
Thanks for the note! To make sure we understand what's going on: you mean that replacing all kinds of whitespace in get_html
and not just \n
with a single space improves the rendering, and then you wouldn't need to preprocess your original texts to use them with streamlit demo?
Currently, in n the get_html
we replace the line break \n
in the html
. For short texts it's ok, but for long text that might have paragraphs, the paragraphs are removed.
I suggest just to remove just the extras spaces html = re.sub("\s{2,}", " ", html)
.
Why?
I think it's because when we have additional spaces in the beggining of a sentence, in markdown it turns into a block, as in the exemple bellow.
additional spaces
I think that is not "newlines seem to mess with the rendering" but the additional tab or spacing in the beginning of a new line.
๐
So an initial caveat that I'm not that familiar with streamlit. I checked their bug tracker but only found one issue related to how you'd be required to use trailing whitespace in markdown like \n
to get a rendered newline.
I think the whitespace would make difference for st.markdown
, but not st.write
? It looks like the parser and textcat visualizers are using st.markdown
, but not the ner visualizer, which is using st.write
. And it looks like get_html
is only used in combination with st.write
.
I've experimented a bit, but I can't find an example where replacing the whitespace makes a difference, at least not for the ner visualizer? Do you have a concrete example of a text where this is a problem?
In my tests it looks like \n
is converted into </br>
(I'm not sure why it's </br>
instead of <br/>
, but this part is coming from spacy/displacy directly), and I can add \n
in various places and it ends up rendered with line breaks like I expect.
I could imagine that the cases using st.markdown
might be rendered strangely due to whitespace in the original text, though?
I think you is right! We ran a couple testing and maybe we were mistaken.
Anyway, thank you for your patience and attention! ๐ฅฒ
Recently, I was working on a side project with some colleagues, playing with Named Entity Recognition, Streamlit, and Spacy.
One of my friends manages to use displacy to generate the html necessary to show the named entities, pretty much like the implementation of
visualize_ner
does.However, this friend removed the line break character
\n
from the text before processing it with spacy and turning it into aDoc
object. Like the 30 line from theutils.py
file.As you might know, this approach shows the texts all clumped together without the line breaks.
We came up with a better solution by replacing all the extras spaces from the generated html like this:
And we were wondering if you guys would like to know that or let the users know it.
Thatโs all folk ๐