SPARQL-Anything / sparql.anything

SPARQL Anything is a system for Semantic Web re-engineering that allows users to ... query anything with SPARQL.
https://sparql-anything.cc/
Apache License 2.0
197 stars 11 forks source link

Lightweight HTML Triplifier #484

Open jmkeil opened 3 weeks ago

jmkeil commented 3 weeks ago

The package io.github.sparql-anything.sparql-anything-html has a heavy storage footprint (>160MB) due to its dependency on com.microsoft.playwright.driver-bundle, which basically five times ships Node.js binaries (Windows, Linux, Linux ARM, Mac and Mac ARM). To my understanding, this is needed to run a headless browser that interprets JS in the triplified HTML.

I guess this is not needed in many use cases.

Therefore, I would like to ask you to consider providing an additional lightweight HTML Triplifier that just triplifies the static HTML document. This would result in significantly smaller binaries of upstream projects and would probably also improve the execution time.

enridaga commented 3 weeks ago

This is very much on point. There is a general problem in building an executable that ends up being terribly large because of all possible dependencies on features that a specific user may not need ... I am not sure how to fix this in the short term. One way could be to declare the dependency as "provided" in the module pom and add it in the launchers' build. I wonder whether there is any good practice we can refer to.

justin2004 commented 3 weeks ago

it would be nice to make the build more a la carte but let's not clobber the ability to easily run the headless browser. i have had to use this capability several times.

jmkeil commented 3 weeks ago

I do not deny the existence of use cases for the headless variant. Therefore, my request was about a lightweight Triplifier in addition.

enridaga commented 2 weeks ago

@jmkeil I'd like to elaborate a strategy to cope with this issue and I think your case is perfect for that. Can you please clarify how are you using the package? Are you embedding the maven package in your own build? If this is the case, it should be as easy as marking the dependencies as provided in the POM and including them only in the runnable builds. Would that work?

jmkeil commented 2 weeks ago

I do not use it yet, but I consider to use it soon, to enable the import of data in several formats into my pipeline based tool. However, it is my concern that the binaries will become quiet large. I thought about either use sparql-anything-engine but exclude the maven packages for a few formats that I do not need or which are to large, or in the first place to only use the sparql-anything-* maven packages, I actually want. Having HTML among the supported formats would be nice, but increasing the binaries size by a magnitude wouldn't be worth it to me.