BHoM / RDF_Prototypes

Research project of the Cluster of Excellence "Integrative Computational Design and Construction for Architecture" (IntCDC) https://www.intcdc.uni-stuttgart.de/ **Project Name**: Knowledge Representation for Multi-Disciplinary Co-Design of Buildings. https://www.intcdc.uni-stuttgart.de/research/research-projects/rp-20/
GNU Lesser General Public License v3.0
9 stars 4 forks source link

How to handle large datasets? Should we store the data somewhere on the cloud/ out of grasshopper? #45

Open DiellzaElshani opened 2 years ago

DiellzaElshani commented 2 years ago

Keeping the data in grasshopper makes it very slow to work with data. I am wondering if we really need to have the ttl graph in grasshopper? What about serializing it, keep it somewhere on the cloud, and deserialize it just if its needed? We don't really need to see the ttl format the whole time, right? We can deserialize it to TTL if needed. We can work with the graph data in GraphDB environment, and grasshopper can be used just to support the process to create this graph database.

alelom commented 2 years ago

Our process is:

runtime objects -> C# Graph -> Ontological format (TTL or WebVOWL)

Given our research, we established that thanks to the features of BHoM it is possible to map the runtime objects to an ontological format. The runtime objects, themselves, are not an ontology. The CSharpGraph is the lightest ontological representation of the runtime objects, because it is still runtime and only uses the bare minimum, which is Reflection object instances (PropertyInfos, TypeInfos, etc). However, CSharpGraph is not interoperable, e.g. you cannot send CSharpGraph to Protegé or GraphDB. The only way to be interoperable is to translate this minimal ontology representation of the CSharpGraph into another format, an interoperable ontological format like e.g. TTL.

This ontological format can take many forms (TTL, WebVOWL json, etc), but eventually, it is always a text. It is like a "serialised representation" of the ontology. There is no way to avoid passing through this serialised format if we want to be able to interoperate with other ontological tools like GraphDB or Protegé.

The conversion to the textual file is the bottleneck of performance. String generation and handling is not fast. This is the upper performance limit of our current approach.

Currently, to make computation faster, you can avoid outputting the text into a panel on the UI (Grasshopper or others, like Excel), by using the TTLGraph() method that takes also a filepath as input. This way, the text is simply written to disk, and improves the speed of the process. However, this still suffers from the limits of string handling.

To see how much of an issue this is, we should compare the computation times for the component called CSharpGraph and the one called TTLGraph. In the backend, TTLGraph simply creates a CSharpGraph and then converts it to TTL. This last step is, I believe, what takes 80% of the computation time, but I am not sure about this proportion. We need to establish this proportion before doing anything else.

So, the TODO is: to evaluate how much of a problem performance is, and verify where the bottleneck is. This means:

  1. Collecting at least two scripts that are actually suffering from unsatisfactory performance.
  2. write down the run time and the proportion
  3. write down the execution time of the CSharpGraph method (time_CSharpGraph)
  4. write down the execution time of the TTLGraph method (time_TTLGraph)
  5. Calculate and write down the proportion (time_CSharpGraph)/time_TTLGraph

If the proportion at step 5 is <0.5, it means that TTLGraph takes twice the time of CSharpGraph: we were right, improvements are to be sought in avoiding the TTL translation. The only alternative is not using a textual format. However, that requires to figure out what other interoperable representation of the ontology we could use.

The only alternative I can think of is directly exporting the objects in the CSharpGraph to a Graph database. The CSharpGraph should be converted to, e.g. Neo4J, or even directly GraphDB, if an API is available. From this comparison it seems that some C# api is indeed available for GraphDB, so maybe we could simply develop a direct connection to a GraphDB database.