Removing extra space when visualizing Named Entities

GiliardGodoi commented 1 year ago

Recently, I was working on a side project with some colleagues, playing with Named Entity Recognition, Streamlit, and Spacy.

One of my friends manages to use displacy to generate the html necessary to show the named entities, pretty much like the implementation of visualize_ner does.

However, this friend removed the line break character \n from the text before processing it with spacy and turning it into a Doc object. Like the 30 line from the utils.py file.

As you might know, this approach shows the texts all clumped together without the line breaks.

We came up with a better solution by replacing all the extras spaces from the generated html like this:

html = re.sub("\s{2,}", " ", html)

And we were wondering if you guys would like to know that or let the users know it.

That’s all folk 😄

adrianeboyd commented 1 year ago

Thanks for the note! To make sure we understand what's going on: you mean that replacing all kinds of whitespace in get_html and not just \n with a single space improves the rendering, and then you wouldn't need to preprocess your original texts to use them with streamlit demo?

GiliardGodoi commented 1 year ago

Currently, in n the get_html we replace the line break \n in the html. For short texts it's ok, but for long text that might have paragraphs, the paragraphs are removed.

I suggest just to remove just the extras spaces html = re.sub("\s{2,}", " ", html).

Why?

I think it's because when we have additional spaces in the beggining of a sentence, in markdown it turns into a block, as in the exemple bellow.

additional spaces

I think that is not "newlines seem to mess with the rendering" but the additional tab or spacing in the beginning of a new line.

😄

adrianeboyd commented 1 year ago

So an initial caveat that I'm not that familiar with streamlit. I checked their bug tracker but only found one issue related to how you'd be required to use trailing whitespace in markdown like \n to get a rendered newline.

I think the whitespace would make difference for st.markdown, but not st.write? It looks like the parser and textcat visualizers are using st.markdown, but not the ner visualizer, which is using st.write. And it looks like get_html is only used in combination with st.write.

I've experimented a bit, but I can't find an example where replacing the whitespace makes a difference, at least not for the ner visualizer? Do you have a concrete example of a text where this is a problem?

In my tests it looks like \n is converted into </br> (I'm not sure why it's </br> instead of <br/>, but this part is coming from spacy/displacy directly), and I can add \n in various places and it ends up rendered with line breaks like I expect.

I could imagine that the cases using st.markdown might be rendered strangely due to whitespace in the original text, though?

GiliardGodoi commented 1 year ago

I think you is right! We ran a couple testing and maybe we were mistaken.

Anyway, thank you for your patience and attention! 🥲

explosion / spacy-streamlit

Removing extra space when visualizing Named Entities #50