Christof93 / SciKGTeX

SciKGTeX is a LuaTeX package which introduces commands to mark research contributions in scientific documents. SciKGTeX will enrich the document by adding your contributions to PDF metadata in a structured XMP format which can be picked up by Search Engines and Knowledge Graphs.
MIT License
17 stars 0 forks source link

Ensure that valid RDF is created #14

Closed okarras closed 1 year ago

okarras commented 1 year ago

Problem: Currently, we sometimes create URIs in the RDF that are not valid/resolvable as we use the label (string) of a property rather than the correct ORKG property ID. The ORKG property ID can be either a string or a number. For example, http://orkg.org/property/objective leads to a 404 error rather than the right property as the property URI for method http://orkg.org/property/P15051.

Goal: We must ensure for others that the generated RDF is valid.

For the 5 pre-defined commands, we should use directly correct ORKG property IDs:

Furthermore, here are four ideas on how to address this issue for self-defined properties by a SciKGTeX user:

  1. Similar to the identification of resources \researchproblem{\uri{https://www.orkg.org/orkg/resource/R12259 }{antibiotic therapy}}, we provide an option so that users can define their own properties with URIs directly.
  2. We create a lookup table for all or top-k ORKG properties, e.g., as .sty, .lua file, or even an entire LaTeX Package that we update regularly.
    
    \documentclass[12pt]{article}

\usepackage{xstring}

\newcommand{\orkglookup}[1]{% \IfEqCase{#1}{ {research field}{P30} {url}{url} % ... } }

\begin{document}

This is the predicate for "research field": \orkglookup{research field}.

\end{document}


![grafik](https://github.com/Christof93/SciKGTeX/assets/5848876/fe52652a-cf3f-431d-acec-f70c2b0d5ebc)

3. We try to integrate a lookup by label using the ORKG API to find the correct ORKG property ID. The user can enter the label, and we search for potential URIs on the fly.
4.  We create blank node with rdf:label and the label of the user so that the users try to resolve the blank node using the provided label.
manuelprinz commented 1 year ago

Searching for a predicate by label is already supported by the API, see http://tibhannover.gitlab.io/orkg/orkg-backend/api-doc/#predicates-lookup. The query parameter can take any valid Lucene query expression.

In case this functionality needs extension in some way, please let us know!

Christof93 commented 1 year ago

Ok so I would propose that we create a build process where the lua file is composed with an automatically generated map <property name -> property id>. I could set up a daily build which fetches all properties and updates the map.

How many properties are there approx.? 🤔

Christof93 commented 1 year ago

Searching for a predicate by label is already supported by the API, see http://tibhannover.gitlab.io/orkg/orkg-backend/api-doc/#predicates-lookup. The query parameter can take any valid Lucene query expression.

In case this functionality needs extension in some way, please let us know!

That would be the easiest solution. The issue with the LaTeX package is that it should run offline. Technically it would even be possible to fetch the api lookup with Lua but in many cases the package will not be able to access the internet (e.g. when run on overleaf).

okarras commented 1 year ago

How many properties are there approx.?

According to the website statistics, ORKG has 9636 properties. However, @manuelprinz checked their frequency in use and at some point there is a larger jump from 1000 to 500 to 100 so we could even define a cut-off point that properties require a specific frequency in use.

Christof93 commented 1 year ago

Considering we take around 20-80 bytes to define one property name-uri pair in the source code, the whole thing should still stay below 1mb which is fairly acceptable imo. We could even think about compressing and storing it in a base64 encoded string if size starts to become a problem.

So what I would need is a way to collect all property names and URIs from the API. What would be the easiest way to achieve that?

manuelprinz commented 1 year ago

I also do not think that size is of any concern. As for the fetching, you can use the API directly, but need to write some glue code to deal with pagination. The Python package should have that.

That being said, we already export data in various forms for others to consume. Personally, I think it would be even easier to hook the query into our exporting infrastructure, and directly generate what you need. This could be either LaTeX, or Lua, or both. You could fetch that from the web, or we could publish it somewhere. This can be automatically triggered by any schedule that suits, like daily.

I understand that this makes you somewhat dependent on the service, but you already are (to a lesser extend). Both variants (exporting or pulling via the API) are fine for me.

We can clarify the details in a call, if that makes things easier.

Christof93 commented 1 year ago

I already toyed around with the API yesterday when I had some spare time. See this Python script https://github.com/Christof93/SciKGTeX/blob/14-ensure-that-valid-rdf-is-created/build/assemble_lua_source.py.

I find 9637 predicates with 8547 labels. Collection takes ~1min with the standard page size of 20. Btw it did take me a while to figure out the API URL 😬. Even though it's kind of obvious it doesn't really say it in the documentation.. Here's what I can generate from this lua_table_code.txt.

If there are more than 1 properties fitting a label name I will probably just give a warning and instruct the user to double check online and define the correct URI manually.

I think that might be a good starting point. dm me on Skype if you want to schedule a call. (live:christofbless)

manuelprinz commented 1 year ago

it did take me a while to figure out the API URL

Means we need to improve the documentation. Note taken.

I think it is fine to increase the page size. We allow for bigger values, but do not communicate that largely because it is more efficient to fetch smaller pages. (Which can be done in parallel, in priciple; also, the server response is quicker.) The same query on the database takes about 2 seconds.

The output seems fine, although there are some values that are completely odd. I need to look into that. I suspect those are from properties that were created accidentally, and not really used. Your solution via the API will give you the metadata for all properties known, which is not the same as all properties used. (We currently do not delete unused ones automatically, for various reasons.) Running the query on the server could account for that. Other than that, I agree that it is a good starting point. :)

Christof93 commented 1 year ago

Before I merge this I will have to put some thinking into prevention of code injection since generating production code from user generated content is obviously extremely dangerous in that regard 😄.

manuelprinz commented 1 year ago

Just wanted to mention that: We have the same problem, in the sense that it would be possible to store information in the graph that could be potentially dangerous, e.g. an example of an exploit using <script> in a literal. I think a good way to deal with that is to encode it before generating the final results, in your case entity-encoding it before it goes into the XML.

We could offer the option to do that in the API, if that is more convenient, but I think it would make the lookup harder. At that point, it is just string matching, and the risk is basically zero. (Unless of buffer overflows or the like, which we cannot protect for, as it is broken in software downstream.)

Christof93 commented 1 year ago

Yes, entity encoding for sure makes sense for the XML. The bigger problem is the generated Lua code though. Fortunately, I can do a similar thing by escaping reserved characters in Lua such that it is not possible to close the string prematurely. If not treated a user can create a property like ']=nil} print('hello world') -- on ORKG and a day later any code added in there will be run by all the users of SciKGTeX if they use the package 🫣.

Christof93 commented 1 year ago

Let me know if there is a straight forward way to grab all the in-use properties without the deleted ones 👍🏻

manuelprinz commented 1 year ago

The only way I can think of right now is to call the statements endpoint for each predicate to check if it is used in a statement. (It would give no results in case a predicate is not used.) But that is basically 2*N queries. I need to check if we accept multiple ids at once, but my preference would go towards adding support for that in the API, either by supporting it via a parameter like only_used or something similar, or exporting the Lua file directly on our side.

Christof93 commented 1 year ago

If you want to do an export from your side it would suffice to create a json object with labels as keys and list of property ids associated with the label as value. If I can fetch that from somewhere I'm already prepared to make the Lua table out of it.

MarcelKonrad commented 1 year ago

The merge request for the feature in the orkg backend can be found here.

EDIT: Fyi: A full export is currently around 305KB

Christof93 commented 1 year ago

Nice Marcel! Thanks!

Where could I download this automatic export from?

MarcelKonrad commented 1 year ago

The final url has not yet been decided on, as the linked MR has not been reviewed yet. I expect it to be merged by the end of the week or early next week. I will let you know.

MarcelKonrad commented 11 months ago

After some deployment issues, the file is now available at https://incubating.orkg.org/files/mappings/predicate-ids_to_label.json, which is our public testing server. The actual file you want to scrape is soon available at https://orkg.org/files/mappings/predicate-ids_to_label.json

MarcelKonrad commented 11 months ago

Sorry for the long delay. The file is now finally available at https://orkg.org/files/mappings/predicate-ids_to_label.json