MohamedRejeb / Ksoup

Ksoup is a lightweight Kotlin Multiplatform library for parsing HTML, extracting HTML tags, attributes, and text, and encoding and decoding HTML entities.
Apache License 2.0
376 stars 10 forks source link

decodeHtml4 & decodeHtml5 in one step #18

Closed vanniktech closed 1 year ago

vanniktech commented 1 year ago

decodeHtml5 isn't a set which also contains the decodeHtml4 entities. I'd like to invoke both. I know that I can call them one after another however, but I'm guessing performance wise a single call would be better. Since it'd only need to loop once instead of twice over the string. Also NumericEntityDecoder() is used twice which is unnecessary.

Esentially, I want an aggregator like this:

AggregateTranslator(
    LookupTranslator(EntityMaps.HTML4Decode),
    LookupTranslator(EntityMaps.HTML5Decode),
    NumericEntityDecoder(),
)

Side note: I also found decodeHtml confusing as at first I thought this is decode4+5 but it's just 5, in which case, I'd go for removing that method to make the API smaller & conciser.

MohamedRejeb commented 1 year ago

decodeHtml5 already contains all HTML4 entities so you can just use decodeHtml5 and you will cover all the entities. True decodeHtml is confusing, maybe I'll deprecate it and suggest replacing it with decodeHtml5. I added it to make it clear that this is going to cover any HTML entity (4+5) so people don't get confused

vanniktech commented 1 year ago

It does not. The following test case fails for HTML5:

  @Test fun decode() {
    assertEquals("Å", KsoupEntities.decodeHtml4("Å"))
    assertEquals("Å", KsoupEntities.decodeHtml5("Å"))
  }
MohamedRejeb commented 1 year ago

It's a bug. I'll check this

MohamedRejeb commented 1 year ago

I'll release a new version soon containing the fix for this issue and the other issues as well.