Closed vanniktech closed 1 month ago
I agree with the word "Extremely". The performance is far below usable.
Jsoup spent 143 ms to parse this file, however, Ksoup spent 90 seconds which is >600x slow.
The Jsoup was tested on desktop JVM, and Ksoup was tested on iosSimulatorArm64. Although using the simulator might run a bit slower but it should not be ~600x.
For a smaller file which is 459kb, Ksoup took 20s while Jsoup only needed 0.1s (including VM startup time).
@vanniktech @Him188 thanks for your feedback. I'm aware of this performance issue and will optimize it in the next few versions.
@vanniktech I think it would be great if we had the option to ignore text, as it may save a lot of memory.
@itboy87 Thanks. This is an amazing project and I'm looking forward to the updates
@Him188 Thanks. I'm working on it.
For a smaller file which is 459kb, Ksoup took 20s while Jsoup only needed 0.1s (including VM startup time).
Could you please share the sample code with me? I'm testing it, and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.
and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.
That's still 10x slower.
and it took only 1 second to parse this file with ksoup, while Jsoup took 0.1 seconds.
That's still 10x slower.
@vanniktech yup i know. I'm looking into this, but @Him188 mentioned 90s
@vanniktech @Him188 Currently, I'm working on two versions: one built with Ktor and kotlinx, and the other built using Korlibs. I see that the Korio branch has better performance; it took only 120ms compared to 1100ms.
Actually, for now, I'm not sure which I will use for the upcoming versions. Korlibs are not widely used but are good; on the other hand, kotlinx-io and Ktor are more like standard libraries for kotlin. I might optimize ksoup with kotlinx-io and Ktor, or just go with Korio, which is already performing well. I haven't decided yet.
Upcoming version 0.1.3 is ready to publish which is using korlibs
I would recommend kotlinx-io because it's official. People are likely to be (already) using it and don't want to have multiple IO libraries. Or we can introduce separate modules for io support: ksoup-io
for kotlinx-io (same naming convention used by kotlinx-serialization) and ksoup-korio
.
Since you mentioned Ktor, let me also share some of my though about it :)
Ktor currently maintains two major versions, 2.x and 3.x. 3.x is still in alpha and is binary incompatible with 2.x. I would expect most of the exisiting projects are using 2.x, and new projects are also likely to use the latest stable version 2.x.
However, Ksoup depends on ktor-client-core 3.x, forcing its consumers to also use ktor 3.x. If the consumer (like my project) is using ktor 2.x, code still compiles, but it throws ClassNotFoundError at runtime. I had to migrate my project to 3.x in order to use Ksoup.
So Ksoup may also publish separate variants based on ktor 2.x and 3.x. However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library.
I can guess why Ksoup needs ktor - maybe because of the Charset implementation. From my memory, kotlinx-io seems to also support UTF-8, but only internally as Source.readString
or something. That's out of my knowledge so there's nothing I can help :(
Testing code:
Relevant code extracted:
Note that the Xml
is expect-actual. On native platforms it's a typealias to Ksoup, and on JVM it's a typealias to Jsoup.
fun parseMikanSubjectIdsFromSearch(document: Document): List<String> {
return document.getElementsByClass("an-info").mapNotNull { anInfo ->
anInfo.parent()?.let { a ->
val attr = a.attr("href")
if (attr.isEmpty()) return@let null
attr.substringAfter("/Home/Bangumi/", "")
.takeIf { it.isNotBlank() }
}
}
}
@Test
fun `can parse subject index`() {
val ids = AbstractMikanMediaSource.parseMikanSubjectIdsFromSearch(
Xml.parse(
readTestResourceAsString("/mikan-search-无职转生.txt"),
),
)
assertEquals(listOf(3060, 2353, 2549, 3344).map { it.toString() }, ids)
}
The resource is already read as a string so I would not expect such large performance difference on the IO side? Maybe the getElementsByClass
functions are actually to be blame?
Please note that I was comparing Ksoup on iosSimulatorArm64 and Jsoup on desktop JVM. Maybe the simulator is actually far slower than I expected.
However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library.
This would really be ideal. The less dependencies the better. I am also now still using Ktor 2.0 and I can't upgrade to a beta version.
I would recommend kotlinx-io because it's official. People are likely to be using it and don't want to have multiple IO libraries. Or we can introduce separate modules for io support:
ksoup-io
for kotlinx-io (same naming convention used by kotlinx-serialization) andksoup-korio
.
@Him188 agree with you.
Please note that I was comparing Ksoup on iosSimulatorArm64 and Jsoup on desktop JVM. Maybe the simulator is actually far slower than I expected.
Yes simulator may not perform like physical device. But still it need lot of improvements. I'm working on it and i will publish both kotlinx-io and korio variant.
getElementsByClass
even if you use html string for parse it still use lot of IO operations for parsing and streaming.
However, I might recommend instead, not depending on ktor, as it does not sound neccessary for a XML parser to depend on a HTTP client library.
This would really be ideal. The less dependencies the better. I am also now still using Ktor 2.0 and I can't upgrade to a beta version.
Actually I'm using ktor for charset encoder and decoder. Which is currently not available in kotlinx-io. Upcoming version will use ktor 3 because it support more targets like wasm. I may also publish one variant with ktor 2
@vanniktech @Him188 version 0.1.4 released with Korio with performance issues fixed and I'm already working on kotlinx-io + ktor variant.
FWIW I'm trying out ksoup in my library, unfurl and I'm finding v0.1.4
to be ~2x-3x slower than jsoup.
@itboy87 thanks for cutting a new release! I will try it out in the next few days.
Regardless, I haven't checked this yet but I feel like that plain jsoup running on Android is slower than on Desktop. Obviously a Desktop is much more powerful but maybe jsoup does something that the Android phones don't like, some specific to the Android runtime.
@saket @Him188, thanks for your feedback! I've been working on separating the IO dependency from the core code, which has now been implemented in the develop branch. I've also added performance comparison test code for Ksoup vs. Jsoup, which you can find here PerformanceComparisonTest.kt.
Next, I'm finalizing the kotlinx-io variant, which is almost complete. After that, I'll focus on addressing performance issues.
@itboy87 That sounds very nice! I will give it a try when you finish kotlinx-io variant.
@itboy87 nice improvements. It's much faster, so much that I'm making the switch, I will release a new version and take it from there.
I tested v0.1.4
and it worked well. It took just 1s to parse the case that previously required 90s (on iosSimulatorArm64).
Good job!
@saket @vanniktech @Him188 Thanks for your feedback. I have released Ksoup 0.1.5
with three variants: kotlinx-io + ktor3, kotlinx-io + ktor2, and korlibs-io. Many performance issues have been fixed. Please give it a try and let me know your feedback. For now, I’m closing this issue. Feel free to open a new one if you encounter any issues.
Next, I’m also working on a variant with no external dependencies, which will be a lightweight version supporting only string HTML and XML parsing and UTF-8.
I tried to upgrade and use the com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.5
but I'm getting:
* What went wrong:
Configuration cache state could not be cached: field `runtimeDependencies` of task `:app-rss-reader:dataBindingMergeDependencyArtifactsDebug` of type `com.android.build.gradle.internal.tasks.databinding.DataBindingMergeDependencyArtifactsTask`: error writing value of type 'org.gradle.api.internal.artifacts.configurations.ResolutionBackedFileCollection'
> Could not resolve all files for configuration ':app-rss-reader:debugRuntimeClasspath'.
> Could not find ksoup-ktor2:ksoup-engine-common:unspecified.
Searched in the following locations:
- file:/Users/niklas/.m2/repository/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://repo.maven.apache.org/maven2/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://dl.google.com/dl/android/maven2/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://jitpack.io/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://maven.pkg.github.com/intergi/playwire-android-binaries/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://android-sdk.is.com/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://artifact.bytedance.com/repository/pangle/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
- https://cboost.jfrog.io/artifactory/chartboost-ads/ksoup-ktor2/ksoup-engine-common/unspecified/ksoup-engine-common-unspecified.pom
Required by:
project :app-rss-reader > project :feature-rss-reader > com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.5 > com.fleeksoft.ksoup:ksoup-network-ktor2-android:0.1.5 > com.fleeksoft.ksoup:ksoup-engine-kotlinx-ktor2:0.1.5 > com.fleeksoft.ksoup:ksoup-engine-kotlinx-ktor2-android:0.1.5
during Gradle sync.
@vanniktech It was not published correctly, but I just fixed it in version 0.1.6-alpha1 and published it. It will be available in 15 minutes."
Does com.fleeksoft.ksoup:ksoup-network-ktor2:0.1.6-alpha1
not pull in the regular ksoup dependency? I'm getting unresolved references:
I think there are still publishing related failures since:
Configuration cache state could not be cached: field `runtimeDependencies` of task `:app-rss-reader:dataBindingMergeDependencyArtifactsDebug` of type `com.android.build.gradle.internal.tasks.databinding.DataBindingMergeDependencyArtifactsTask`: error writing value of type 'org.gradle.api.internal.artifacts.configurations.ResolutionBackedFileCollection'
> Could not resolve all files for configuration ':app-rss-reader:debugRuntimeClasspath'.
> Could not find com.fleeksoft.ksoup:ksoup:0.1.6-alpha1.
@vanniktech please use com.fleeksoft.ksoup:ksoup-ktor2 with ksoup-network-ktor2 not com.fleeksoft.ksoup:ksoup
@Him188 @saket @vanniktechThanks for your feedback! The upcoming version is optimized and has the same performance as Jsoup on JVM. Please check attached screenshots
Amazing stuff!
Currently I have two native parser implementations. Android uses
DocumentBuilderFactory
and the likes and on iOS I useNSXMLParser
. I'd like to replace this with Ksoup so I can also share the parsing logic and have everything unified. Additionally, Ksoup is a lot more lenient when it comes to the parsing logic which is nice because Rss Feeds often contain unescaped&
which throws off both of my parser right now and Ksoup would solve this as well. However Ksoup is in some instances substantially slower, for instance when trying to parse the XML from this site: https://www.1978.tokyo/rssI've ran a few tests on my phone and
Ksoup
is ~3x slower thanDocumentBuilderFactory
. My assumption is that it also parses all the 'text' as Nodes that are contained in each<description>
tag for instance. Is there any way to turn this off?