Selectors with HTML gobble the HTML Tag

uskudarli commented 8 years ago

As Emrah pointed out and testing also revealed, When working with HTML documents, The text selectors are sending HTML tags to the black hole.

beratdogan commented 8 years ago

From the Web Annotation Model description: http://www.w3.org/TR/annotation-model/#text-quote-selector

The text must be normalized before recording. Thus HTML/XML tags should be removed, character entities should be replaced with the character that they encode, unnecessary whitespace should be normalized, character encoding should be turned into UTF-8, and so forth. The normalization routine may be performed automatically by a browser, and other applications should implement the DOM String Comparisons method. This allows the Selector to be used with different encodings and user agents and still have the same semantics and utility.

So we must strip HTML tags before create annotation. But how renarration will find and replace the text?

@uskudarli @tbdinesh @EmrahGuder

uskudarli commented 8 years ago

I think we need to study that more. I think it means not to use the tags when computing the text characteristics. I do not think the intent is to remove the tags from the content.

The idea would be to reduce all equivalent appearing text to the same representation in order to find the proper location.

This is a tough task. It is valid for all applications. It makes more sense for a browser to handle this uniformly I think.

Just in case, in Turkish. Benim bundan anladigim su: kodda (HTML/XML) farkli yazilan bir cok yazi yuzeyde ayni gorunur. Yani okuyucu onlari ayni olara algilar. Hepsine es degerdir diyebilecegin bir noramlized hali olursa, bir hesaplama yapmadan ilk once normalized haline cevirirsin, islemleri ona gore yaparsin (adresleme mesela). Hesap net olur kisacasi. Bununlu ilgili bir suru seyin cozulmesi gerekir.

Konu genel olarak cok onemli. Su anda demo icin yeterli vakit yokugrasacak. Emrah'in demosu da bunlari handle etmiyor su anda.

Ilginc ve onemli bir konu... Bir cok yerde karsimiza cikacak bence. Bizim isler hep boyle detaylari cozmekle dolu, degil mi?

On Sat, May 21, 2016 at 2:23 AM, Berat Doğan notifications@github.com wrote:

From the Web Annotation Model description: http://www.w3.org/TR/annotation-model/#text-quote-selector

The text must be normalized before recording. Thus HTML/XML tags should be removed, character entities should be replaced with the character that they encode, unnecessary whitespace should be normalized, character encoding should be turned into UTF-8, and so forth. The normalization routine may be performed automatically by a browser, and other applications should implement the DOM String Comparisons method. This allows the Selector to be used with different encodings and user agents and still have the same semantics and utility.

So we must strip HTML tags before create annotation. But how renarration will find and replace the text?

@uskudarli https://github.com/uskudarli @tbdinesh https://github.com/tbdinesh @EmrahGuder https://github.com/EmrahGuder

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/crazy-annotators/annotator-extension/issues/8#issuecomment-220740936

crazy-annotators / annotator-extension

Selectors with HTML gobble the HTML Tag #8