fleeksoft / ksoup

Ksoup is a Kotlin Multiplatform library for working with HTML and XML. It's a port of the renowned Java library Jsoup.
https://fleeksoft.github.io/ksoup/
Apache License 2.0
284 stars 11 forks source link

Extremely slow parsing of XML #39

Closed vanniktech closed 1 month ago

vanniktech commented 2 months ago

Currently I have two native parser implementations. Android uses DocumentBuilderFactory and the likes and on iOS I use NSXMLParser. I'd like to replace this with Ksoup so I can also share the parsing logic and have everything unified. Additionally, Ksoup is a lot more lenient when it comes to the parsing logic which is nice because Rss Feeds often contain unescaped & which throws off both of my parser right now and Ksoup would solve this as well. However Ksoup is in some instances substantially slower, for instance when trying to parse the XML from this site: https://www.1978.tokyo/rss

I've ran a few tests on my phone and Ksoup is ~3x slower than DocumentBuilderFactory. My assumption is that it also parses all the 'text' as Nodes that are contained in each <description> tag for instance. Is there any way to turn this off?

Him188 commented 2 months ago

I agree with the word "Extremely". The performance is far below usable.

Jsoup spent 143 ms to parse this file, however, Ksoup spent 90 seconds which is >600x slow.

The Jsoup was tested on desktop JVM, and Ksoup was tested on iosSimulatorArm64. Although using the simulator might run a bit slower but it should not be ~600x.

mikan-search-无职转生.txt

Him188 commented 2 months ago

For a smaller file which is 459kb, Ksoup took 20s while Jsoup only needed 0.1s (including VM startup time).

mikan-bangumi-无职转生.txt

itboy87 commented 2 months ago

@vanniktech @Him188 thanks for your feedback. I'm aware of this performance issue and will optimize it in the next few versions.

@vanniktech I think it would be great if we had the option to ignore text, as it may save a lot of memory.

Him188 commented 2 months ago

@itboy87 Thanks. This is an amazing project and I'm looking forward to the updates

itboy87 commented 2 months ago

@Him188 Thanks. I'm working on it.

itboy87 commented 2 months ago

For a smaller file which is 459kb, Ksoup took 20s while Jsoup only needed 0.1s (including VM startup time).

mikan-bangumi-无职转生.txt

Could you please share the sample code with me? I'm testing it, and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.

vanniktech commented 2 months ago

and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.

That's still 10x slower.

itboy87 commented 2 months ago

and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.

That's still 10x slower.

@vanniktech yup i know. I'm looking into this, but @Him188 mentioned 90s

itboy87 commented 2 months ago

@vanniktech @Him188 Currently, I'm working on two versions: one built with Ktor and kotlinx, and the other built using Korlibs. I see that the Korio branch has better performance; it took only 120ms compared to 1100ms.

Actually, for now, I'm not sure which I will use for the upcoming versions. Korlibs are not widely used but are good; on the other hand, kotlinx-io and Ktor are more like standard libraries for kotlin. I might optimize ksoup with kotlinx-io and Ktor, or just go with Korio, which is already performing well. I haven't decided yet.

Upcoming version 0.1.3 is ready to publish which is using korlibs

Him188 commented 2 months ago

I would recommend kotlinx-io because it's official. People are likely to be (already) using it and don't want to have multiple IO libraries. Or we can introduce separate modules for io support: ksoup-io for kotlinx-io (same naming convention used by kotlinx-serialization) and ksoup-korio.


Since you mentioned Ktor, let me also share some of my though about it :)

Ktor currently maintains two major versions, 2.x and 3.x. 3.x is still in alpha and is binary incompatible with 2.x. I would expect most of the exisiting projects are using 2.x, and new projects are also likely to use the latest stable version 2.x.

However, Ksoup depends on ktor-client-core 3.x, forcing its consumers to also use ktor 3.x. If the consumer (like my project) is using ktor 2.x, code still compiles, but it throws ClassNotFoundError at runtime. I had to migrate my project to 3.x in order to use Ksoup.

So Ksoup may also publish separate variants based on ktor 2.x and 3.x. However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library. I can guess why Ksoup needs ktor - maybe because of the Charset implementation. From my memory, kotlinx-io seems to also support UTF-8, but only internally as Source.readString or something. That's out of my knowledge so there's nothing I can help :(

Him188 commented 2 months ago

Testing code:

https://github.com/open-ani/ani/blob/f7366d424151200644569a2467a12a5b61289110/data-sources/bt/mikan/src/commonMain/kotlin/MikanMediaSource.kt#L221

https://github.com/open-ani/ani/blob/master/data-sources/bt/mikan/src/commonTest/kotlin/MikanSubjectIndexTest.kt

Relevant code extracted:

Note that the Xml is expect-actual. On native platforms it's a typealias to Ksoup, and on JVM it's a typealias to Jsoup.

        fun parseMikanSubjectIdsFromSearch(document: Document): List<String> {
            return document.getElementsByClass("an-info").mapNotNull { anInfo ->
                anInfo.parent()?.let { a ->
                    val attr = a.attr("href")
                    if (attr.isEmpty()) return@let null

                    attr.substringAfter("/Home/Bangumi/", "")
                        .takeIf { it.isNotBlank() }
                }
            }
        }

    @Test
    fun `can parse subject index`() {
        val ids = AbstractMikanMediaSource.parseMikanSubjectIdsFromSearch(
            Xml.parse(
                readTestResourceAsString("/mikan-search-无职转生.txt"),
            ),
        )
        assertEquals(listOf(3060, 2353, 2549, 3344).map { it.toString() }, ids)
    }

The resource is already read as a string so I would not expect such large performance difference on the IO side? Maybe the getElementsByClass functions are actually to be blame?

Him188 commented 2 months ago

Please note that I was comparing Ksoup on iosSimulatorArm64 and Jsoup on desktop JVM. Maybe the simulator is actually far slower than I expected.

vanniktech commented 2 months ago

However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library.

This would really be ideal. The less dependencies the better. I am also now still using Ktor 2.0 and I can't upgrade to a beta version.

itboy87 commented 2 months ago

I would recommend kotlinx-io because it's official. People are likely to be using it and don't want to have multiple IO libraries. Or we can introduce separate modules for io support: ksoup-io for kotlinx-io (same naming convention used by kotlinx-serialization) and ksoup-korio.

@Him188 agree with you.

itboy87 commented 2 months ago

Please note that I was comparing Ksoup on iosSimulatorArm64 and Jsoup on desktop JVM. Maybe the simulator is actually far slower than I expected.

Yes simulator may not perform like physical device. But still it need lot of improvements. I'm working on it and i will publish both kotlinx-io and korio variant.

itboy87 commented 2 months ago

getElementsByClass

even if you use html string for parse it still use lot of IO operations for parsing and streaming.

itboy87 commented 2 months ago

However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library.

This would really be ideal. The less dependencies the better. I am also now still using Ktor 2.0 and I can't upgrade to a beta version.

Actually I'm using ktor for charset encoder and decoder. Which is currently not available in kotlinx-io. Upcoming version will use ktor 3 because it support more targets like wasm. I may also publish one variant with ktor 2

itboy87 commented 2 months ago

@vanniktech @Him188 version 0.1.4 released with Korio with performance issues fixed and I'm already working on kotlinx-io + ktor variant.

saket commented 2 months ago

FWIW I'm trying out ksoup in my library, unfurl and I'm finding v0.1.4 to be ~2x-3x slower than jsoup.

vanniktech commented 2 months ago

@itboy87 thanks for cutting a new release! I will try it out in the next few days.

Regardless, I haven't checked this yet but I feel like that plain jsoup running on Android is slower than on Desktop. Obviously a Desktop is much more powerful but maybe jsoup does something that the Android phones don't like, some specific to the Android runtime.

itboy87 commented 2 months ago

@saket @Him188, thanks for your feedback! I've been working on separating the IO dependency from the core code, which has now been implemented in the develop branch. I've also added performance comparison test code for Ksoup vs. Jsoup, which you can find here PerformanceComparisonTest.kt.

Screenshot 2024-08-13 at 1 59 05 AM

Next, I'm finalizing the kotlinx-io variant, which is almost complete. After that, I'll focus on addressing performance issues.

Him188 commented 2 months ago

@itboy87 That sounds very nice! I will give it a try when you finish kotlinx-io variant.

vanniktech commented 2 months ago

@itboy87 nice improvements. It's much faster, so much that I'm making the switch, I will release a new version and take it from there.

Him188 commented 2 months ago

I tested v0.1.4 and it worked well. It took just 1s to parse the case that previously required 90s (on iosSimulatorArm64). Good job!

itboy87 commented 1 month ago

@saket @vanniktech @Him188 Thanks for your feedback. I have released Ksoup 0.1.5 with three variants: kotlinx-io + ktor3, kotlinx-io + ktor2, and korlibs-io. Many performance issues have been fixed. Please give it a try and let me know your feedback. For now, I’m closing this issue. Feel free to open a new one if you encounter any issues.

Next, I’m also working on a variant with no external dependencies, which will be a lightweight version supporting only string HTML and XML parsing and UTF-8.

Screenshot 2024-08-24 at 12 49 46 AM
vanniktech commented 1 month ago

I tried to upgrade and use the com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.5 but I'm getting:

* What went wrong:
Configuration cache state could not be cached: field `runtimeDependencies` of task `:app-rss-reader:dataBindingMergeDependencyArtifactsDebug` of type `com.android.build.gradle.internal.tasks.databinding.DataBindingMergeDependencyArtifactsTask`: error writing value of type 'org.gradle.api.internal.artifacts.configurations.ResolutionBackedFileCollection'
> Could not resolve all files for configuration ':app-rss-reader:debugRuntimeClasspath'.
   > Could not find ksoup-ktor2:ksoup-engine-common:unspecified.
     Searched in the following locations:
       - file:/Users/niklas/.m2/repository/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://repo.maven.apache.org/maven2/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://dl.google.com/dl/android/maven2/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://jitpack.io/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://maven.pkg.github.com/intergi/playwire-android-binaries/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://android-sdk.is.com/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://artifact.bytedance.com/repository/pangle/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
       - https://cboost.jfrog.io/artifactory/chartboost-ads/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
     Required by:
         project :app-rss-reader > project :feature-rss-reader > com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.5 > com.fleeksoft.ksoup:ksoup-network-ktor2-android:0.1.5 > com.fleeksoft.ksoup:ksoup-engine-kotlinx-ktor2:0.1.5 > com.fleeksoft.ksoup:ksoup-engine-kotlinx-ktor2-android:0.1.5

during Gradle sync.

itboy87 commented 1 month ago

@vanniktech It was not published correctly, but I just fixed it in version 0.1.6-alpha1 and published it. It will be available in 15 minutes."

vanniktech commented 1 month ago

Does com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.6-alpha1 not pull in the regular ksoup dependency? I'm getting unresolved references:

Screenshot 2024-08-24 at 12 07 43

I think there are still publishing related failures since:

Configuration cache state could not be cached: field `runtimeDependencies` of task `:app-rss-reader:dataBindingMergeDependencyArtifactsDebug` of type `com.android.build.gradle.internal.tasks.databinding.DataBindingMergeDependencyArtifactsTask`: error writing value of type 'org.gradle.api.internal.artifacts.configurations.ResolutionBackedFileCollection'
> Could not resolve all files for configuration ':app-rss-reader:debugRuntimeClasspath'.
   > Could not find com.fleeksoft.ksoup:ksoup:0.1.6-alpha1.
itboy87 commented 1 month ago

@vanniktech please use com.fleeksoft.ksoup:ksoup-ktor2 with ksoup-network-ktor2 not com.fleeksoft.ksoup:ksoup

itboy87 commented 1 month ago

@Him188 @saket @vanniktechThanks for your feedback! The upcoming version is optimized and has the same performance as Jsoup on JVM. Please check attached screenshots

Screenshot 2024-09-19 at 10 19 12 PM Screenshot 2024-09-19 at 10 10 46 PM Screenshot 2024-09-19 at 10 10 36 PM
saket commented 1 month ago

Amazing stuff!