datapurifier / landingpage-in-a-box

Hugo-based method for building a landing page system from DataCite metadata
MIT License
0 stars 0 forks source link

Process subjects with entity recognition #3

Open skybristol opened 9 months ago

skybristol commented 9 months ago

Most subjects in DataCite records do not appear to contain scheme information placing them into a particular vocabulary or URIs pointing to resolvers for the terms themselves. For those that do, I will examine whether there are any common patterns that we want to exploit (just because a term points to a source for definition, it doesn't mean it's useful). For most other terms, we can run a simple entity recognition process to at least group things out into likely categories (e.g., place names will be particularly useful if split out). We could also pull the descriptions into the NLP process, and I may test the efficacy of this approach.

The terms will only really become meaningful if we link them to a source for definition and semantic classification. We might try this eventually, but for now I just want to break out a couple of additional logical taxonomies.

skybristol commented 9 months ago

My initial tests on this were not very positive. There is such a hodgepodge of subjects on DataCite items. Even those that purport to come from a given scheme source do not always pan out to be linkable; at least not without a lot of work. An approach that uses LLM technology and incorporates titles and descriptions might prove more fruitful in identifying and building in linked subject matter. With such a wide open context, we even have trouble with things like species name recognition (e.g., I got false hits on "Phosphorous" as a name that crosses disciplines). Another approach might be to simply run through and link everything we can by basic labels to a small set of reference material. This is the same issue we have everywhere right now.