ResearchObject / ro-crate-py

Python library for RO-Crate
https://pypi.org/project/rocrate/
Apache License 2.0
49 stars 26 forks source link

Iterate through graph #131

Closed SteffenBrinckmann closed 2 years ago

SteffenBrinckmann commented 2 years ago

Hey, is there an easy way to iterate/walk through the graph in python? As far as I see it in version 0.7: the top-node is parsed and then upon request one can go to a different node. Is there an automatic function to iterate/walk through each node? Thanks, Steffen

simleo commented 2 years ago

Use get_entities. Example:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@id": "ro-crate-metadata.json",
      "@type": "CreativeWork",
      "about": {"@id": "./"},
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"}
    },
    {
      "@id": "./",
      "@type": "Dataset",
      "author": {"@id": "https://orcid.org/0000-0002-1825-0097"}
    },
    {
      "@id": "https://orcid.org/0000-0002-1825-0097",
      "@type": "Person"
    }
  ]
}
for e in crate.get_entities():
    print((e.id, e.type))
('ro-crate-metadata.json', 'CreativeWork')
('./', 'Dataset')
('https://orcid.org/0000-0002-1825-0097', 'Person')
SteffenBrinckmann commented 2 years ago

Sorry for not being clear. I can iterate through the top-level entities as you mentioned. That is a list, which is a special case of graph.

But each entity can have a 'hasPart' property which contains a list of ids for 'sub'-entities, which can then contain even another level of 'hasPart' and so on. That would build a complete graph, potentially.

Is there a method, to iterate through all of those other than implementing an recursive function which might even run into endless loops, if sub-sub-nodes become the parent nodes in a complex graph?

simleo commented 2 years ago

Such a graph can be built for any kind of relationship, not just hasPart. I think it's best to use a specialized library such as networkx for that. You could try something like this:

from rocrate.rocrate import ROCrate
from rocrate.model.entity import Entity
import networkx as nx

crate = ROCrate("/path/to/crate")
g = nx.DiGraph()
for e in crate.get_entities():
    parts = e.get("hasPart")
    if not parts:
        continue
    if not isinstance(parts, list):
        parts = [parts]
    for p in parts:
        if isinstance(p, Entity):
            g.add_edge(e.id, p.id)

At this point you can iterate through the nodes via the networkx API. Any time you need to resolve an id back to the entity, just use crate.get.

SteffenBrinckmann commented 2 years ago

Thanks @simleo, for the help.

An related question: am I correct that only the ro-crate top-level is parsed?

Can I force that all entities are parsed? Why is there a '#'-prefixed?

simleo commented 2 years ago

RO-Crate metadata files contain flattened JSON-LD, so everything is top-level (all entities appear directly under @graph). Identifiers with a leading # are local to the RO-Crate; the corresponding entities are parsed just like the others.

As for Python type (type(e)) vs semantic type (e.type): the ro-crate-py model defines only a small number of specialized types, typically in cases where there is significant functionality associated with them. For instance, for File entities there is a File Python class with methods that specify what to do when it's written to disk. Similarly, directories are modeled by Dataset, which has a corresponding specific Python type. The vast majority of data entities fall into these two groups, so they will have a specific Python type. In most cases, however, The Python type for a contextual entity would just be ContextEntity. In the past we've explored the possibility of mapping all of Schema.org into the Python class hierarchy, but we abandoned it since it would gain us little while adding a lot of unnecessary complexity. Also note that RO-Crate entities can have multiple types, not necessarily tied by a parent-child relationship, so in general there cannot be a perfect matching between Python and semantic type.

SteffenBrinckmann commented 2 years ago

Thank you so much for all the explanations.