doccano / doccano

Open source annotation tool for machine learning practitioners.
MIT License
9.41k stars 1.72k forks source link

Feature Request: Support HTML syntax in Document #35

Open tilusnet opened 5 years ago

tilusnet commented 5 years ago

Hi,

I would like to able to format e.g. document classification content e.g. by means of HTML syntax. Currently all content gets escaped.

Is there any quick remedy to this?

BrambleXu commented 5 years ago

Could you give us more information about your goal? It would be great if you can provide some samples.

tilusnet commented 5 years ago

Perhaps it's easier on an example. Let's take your demo here http://doccano.herokuapp.com/demo/text-classification/ All text is free flowing, unformatted there. If I wanted to get the some of text formatted, e.g. bolden parts of it, or break lines, etc. what's the best way of going about it?

I've seen these are generated by some Django templates with a [[ syntax which do escape HTML ─ like if I want a line break in the text and put <br/> it gets escaped as &lt;br/&gt;, which is undesirable for my purposes.

BrambleXu commented 5 years ago

If your input file is csv file, there is no way to get the formatted view in the doccano annotating page. But we are going to support the json input format. This will render the line breaks in the document classification task. As for the bolden or other html syntax, these are not in the plan.

tilusnet commented 5 years ago

Thanks for the response. I'd rather prefer support for generic, user defined, flexible formatting.

Given that your framework is Django based, all you should consider is a way to allow switching autoescape off.

BrambleXu commented 5 years ago

The front is implemented by Vue, especially for the text in annotation page. It might be more complicated than autoescape. But you can try it locally and give us some feedback. Thanks.

tilusnet commented 5 years ago

I tried to enable autoescape but it doesn't work. I'd need some guidelines on how the doccano templates work. For instance while Django's body syntax is in curly brackets {{ body }}, doccano's are in square brackets [[ body ]].

In general, I think it would be good for this project if it documented how to use and create templates.

BrambleXu commented 5 years ago

Thanks for your feedback. Your request about the document is very reasonable. But right now we devote most of our time in useful features or solving bugs. The documentation might come late. As for the doccano templates, I will give whatever I know for your implementation.

As for your question, why doccano use square brackets. The reason is to distinguish between Vue syntax and Django syntax. We use square brackets [[ body ]] for Vue and curly brackets {{ body }} for Django. You can find the delimiters setting in js file, such as label.js, document_classification.js, seq2seq.js, and so on.

In your case, because text is in document_classification.html, you might need to change the document_classification.js for autoescape. I am not familiar with Vue. That's all I can provide.

Hope this could help you.

tilusnet commented 5 years ago

Many thanks for your tips!

malteos commented 5 years ago

@tilusnet Any progress?

I tried to get HTML syntax working but it seems to be not that trivial, specially serialization of text selection to send data back to Django. But for those who want to try some useful resources:

icoxfog417 commented 5 years ago

Thank you for proposing nice libraries. But as this issue closed, we don't support HTML syntax in the dataset. If you have to deal with it, you can modify the front end code for handling HTML by the above way.

HassanChmsdn commented 4 years ago

I was also trying to add support of HTML syntax for sequence labeling, and I have added v-html="textPart(r)" in annotator.vue Screenshot 2019-10-31 at 12 30 42

and the result i got in doccano is as follow the labels are applied to chunks but you cannot add any new label or remove it.

Screenshot 2019-10-31 at 12 36 49

Could anybody point me to what I might need to change also, or if I am wrong by adding the v-html there, or even where is the text being processed in the code ?

icoxfog417 commented 4 years ago

Re-open this issue because of many requests for this feature. And the other annotation tools tend to support this.

HassanChmsdn commented 4 years ago

Re-open this issue because of many requests for this feature. And the other annotation tools tend to support this.

Any updates about this issue?

watsonix commented 4 years ago

Looking for an update here too. This is the most important feature missing for us.

This would make our text labeling with Doccano so much more functional to be able to have at least bold. Or to be able to change the size or color of the font.

There are many scenarios when the thing to be labeled should have visual primacy and yet its also important to include textual context before or after.

Also is there a way to vote for the feature request here? Or only commenting / thumbs upping?

ljades commented 3 years ago

Also hoping for an update on this. This is basically the last feature keeping us from moving forward using Doccano as our new data annotation tool.

malteos commented 3 years ago

Not true HTML support but for Open Redact we've build an annotation tool based on React JS that supports paragraphs which you could stylize with CSS. See https://github.com/openredact/openredact-app

Hironsan commented 3 years ago

In v1.3.0, I added TextFile option for an uploading option. It may be useful for this feature.

image

janheinrichmerker commented 3 years ago

@Hironsan Your screenshot looks very nice! :smiley: How can I use that feature? Or is this just a preview, not yet implemented?

MaxKman commented 3 years ago

Same question here. I agree that highlighting text seems to be the one important thing missing in what is otherwise a great tool (thanks so much for developing it)!

For my purpose it wouldn't even have to be html formatting directly within the text. I'd be more than happy to add highlights via the json input similar to what you have already implemented for sequence annotations, e.g.

{"text": "EU rejects German call to boycott British lamb.", "highlight": [ [0, 2, "#ffff00"], [19, 22, "#ffb000"] ]} to highlight 'EU' in yellow and 'call' in orange in the DocumentClassification mode. Would this be more simple to implement?

liorshk commented 2 years ago

@Hironsan How can we use it? We are looking to tag HTML text files.

jhdxr commented 2 years ago

it's not hard to render HTML in vue.js, and I'm happy to contribute a PR for it. but before I start the PR, there are 2 issues I think we have to clear:

  1. how do we determine if a dataset should be rendered as HTML or plain text?
  2. for tasks involving positions of characters, how do we record the positions?

for the first issue, my suggestion is to mark those HTML in the metadata of each record in the dataset. It can be changed in the UI, or automaticlly marked when uploading *.html (or maybe a option for ingesting?)

for the second issue, TBH I don't have good idea, maybe we only enable it for classification task?

thoughts?

Hironsan commented 2 years ago

For the first issue, I think one of the options is to create an HTML File import option. In this case, we may need to add the is_html field to the Example model.

For the second issue, this is a difficult issue. An idea is to save start/end elements with start/end offset. The current implementation is the absolute offset starting from 0(starting position of the text). In the case of the HTML file, the offset is relative to the parent element.

For example:

<span>A cat</span><span> is walking.</span>

Annotation: a cat

Annotation: a cat is

This is an idea. Welcome your opinions.

Anyway, the implementation is difficult because the new UI uses SVG. So it might be a good idea to start with a classification task.

jhdxr commented 2 years ago

In this case, we may need to add the is_html field to the Example model.

why not reuse exising meta field? we will be able to reuse the UI for it as well. or maybe I misunderstand this field, can I check with you what's the purpose for it?

An idea is to save start/end elements with start/end offset this is of course a solution, and I suggest things like xpath to represent the tag.

Actually it can be the character position in the HTML as well, take your example (a cat is), the expected output will be

6-28

and let user to handle the parsing and structure.

However, the real case may be much complex. e.g. the HTML might be malformed (while browser is still able to render it). So I guess I will stick with the classification task only plan if you are good with it as well.

Hironsan commented 2 years ago

I have no idea how you will implement the feature by using meta field. Could you please explain in detail?

It needs more discussion for the sequence labeling task. The classification task is a good point to start.

jhdxr commented 2 years ago

well, it's simple. Since meta is a dict, I just added a new pair into it. let's say _is_html => true (prefix _ indicates system generated pair).

However, I dig a bit more into the source code and I realized the meta is shown on the sidebar. As I haven't use it before, I realized I might misunderstand the usage of it. It will be helpful if you can drop me a link for a brief on this function.

Hironsan commented 2 years ago

Meta is for storing information about the data. For example, if a review text is considered to be the data, its title, customer id, product title, product category, review date, and so on are examples of meta information.

I understand your method. But I don’t think that’s a good idea. Actually, we have given special meaning to the meta field. It was good at first, but people didn't use the feature because its a misuse of the meta field, and we forgot the existence of the feature as a result. There has to be something better.

jhdxr commented 2 years ago

I agree that it's not appropriate to use meta for this flag as I understand its use now. My only remaining concern here is if it's a waste/overkill to create a separate column for this flag instead of using a dict/int (bit based flags). I'm OK with either solution. As you are the maintainer, I will leave this question to you and working on the rest first.

ljades commented 2 years ago

Hi, chiming in with some ideas.

On my company's fork (I'd link it but it's on our internal network) we made this work.

Three things to note about how we handled it:

ljades commented 2 years ago

Here's how we did it. It's not the cleanest (FE/Vue isn't even close to my SWE specialty), but here's the component I introduced:

<template>
  <div
    v-if="hasHtmlMarkup"
    id="iframe-wrapper"
    class="v-card__text"
  >
    <iframe
      id="iframe-content"
      :width="iframe.width"
      :height="iframe.height"
      :srcdoc="exampleText"
      frameborder="0"
      sandbox="allow-same-origin"
      @load="resizeIframe"
    ><v-card-text class="title text-pre-wrap" v-text="incompatibleMessage" /></iframe>
  </div>
  <v-card-text
    v-else
    class="title highlight text-pre-wrap"
    style="white-space: pre-wrap;"
    v-text="exampleText"
  />
</template>

<script>
/**
 * A combined component that displays example text provided. If there are HTML elemtns
 * in the string, the text is instead rendered as html within an iframe. Otherwise, the
 * text is rendered in a v-card-text component.
 * @displayName Entity with Optional iframe
 */
export default {
  props: {
    exampleText: {
      type: String,
      required: true
    }
  },
  data() {
    const baseHeight = 50;
    return {
      incompatibleMessage: 'Your browser does not support these iframes',
      iframe: {
        wrapperStyle: null,
        baseHeight,
        width: '100%',
        height: baseHeight.toString(),
      }
    }
  },
  computed: {
    hasHtmlMarkup() {
      // if example text contains html elements, use the iframe rendering option
      // derived the regex from:
      // https://stackoverflow.com/questions/15458876/check-if-a-string-is-html-or-not/15458987
      // it detects open bracket, a case-insensitive character followed by 0
      //    or more whitespaces or non-whitespaces, ending with a closing bracket
      // augmented to include the html character entities as options for brackets because
      // example.text is transformed to swap special characters with these
      return /(<|&lt;)\/?[a-z][\s\S]*(>|&gt;)/i.test(this.exampleText)
    }
  },
  methods: {
    resizeIframe() {
      const exampleIframe = document.getElementById('iframe-content');
      // Scroll height within the iframe for simple text is to small, so words get cut off.
      // Adding a baseheight will add padding to counteract that
      this.iframe.height = (this.iframe.baseHeight
        + parseInt(exampleIframe.contentWindow.document.body.scrollHeight)).toString();
    }
  }
}
</script>

This is a separate component file. Then, in the task type views, we import this component and swap the v-card-text component with the entity-optional-iframe.

ljades commented 2 years ago

A word of warning, however: We're considering swapping to enabling markdown formatting in the future! As it would provide the flexibility we're looking for while being way easier to maintain and write.

VenkateshDas commented 2 years ago

In v1.3.0, I added TextFile option for an uploading option. It may be useful for this feature.

image

@Hironsan Can you please explain how did you achieve this in detail? I am trying to do something like this and the explanation will be really helpful. TIA

Hironsan commented 2 years ago

Hi,

I just used v-html directive. But it can easily lead to XSS vulnerabilities so need to sanitize contents before rendering.

david-engelmann commented 1 year ago

@Hironsan Do you know if any progress has been made on this request?

aCampello commented 1 year ago

That would be such a great feature to have!

abushoeb commented 1 year ago

@ljades hi, can you please tell me how I can incorporate your solutions for my sequence labeling task?

@all - is there any solution I can use for HTML documents as input for labeling?