DerwenAI / kglab

Graph Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, NetworkX, RAPIDS, RDFlib, pySHACL, PyVis, morph-kgc, pslpython, pyarrow, etc.
https://derwen.ai/docs/kgl/
MIT License
581 stars 66 forks source link

NetworkX shape passed value error and how can I help? #201

Closed fils closed 2 years ago

fils commented 3 years ago

@ceteri Paco, So I have some time to spend working with some schema.org based data from Hydroshare and exploring using kglabs to explore it. I'm having issues applying it and I hope in resolving them I might be able to help somehow in the docs and such.

Hopefully this isn't just me being stupid in graph space, but is of some help back to the project. Happy to share.

Working with the same data from Issue 24 and now trying the NetworkX area I got a specific error.

So this code:

import networkx as nx

sparql3 = """
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?subject ?object
WHERE { 
  ?subject a <https://schema.org/Dataset> .
  ?subject <https://schema.org/creator> ?creator .
  ?creator rdf:first ?o .
  ?o <https://schema.org/name> ?object
}
  """

subgraph = kglab.SubgraphMatrix(kg, sparql3)
nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

results in this error

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3293779/2456468873.py in <module>
     13 
     14 subgraph = kglab.SubgraphMatrix(kg, sparql3)
---> 15 nx_graph = subgraph.build_nx_graph(nx.DiGraph(), bipartite=True)

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_nx_graph(self, nx_graph, bipartite)
    250         """
    251         if self.kg.use_gpus:
--> 252             df = self.build_df()
    253             nx_graph.from_cudf_edgelist(df, source="src", destination="dst")
    254         else:

~/.conda/envs/kglab/lib/python3.8/site-packages/kglab/subg.py in build_df(self, show_symbols)
    223 
    224         if self.kg.use_gpus:
--> 225             df = cudf.DataFrame(rows_list, columns=col_names)
    226         else:
    227             df = pd.DataFrame(rows_list, columns=col_names)

~/.conda/envs/kglab/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    257                     )
    258                 else:
--> 259                     self._init_from_list_like(
    260                         data, index=index, columns=columns
    261                     )

~/.conda/envs/kglab/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    397         if columns is not None:
    398             if len(columns) != len(data):
--> 399                 raise ValueError(
    400                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    401                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (5293, 5293), indices imply (5293, 2).

The results of that SPARQL on the graph should be like:

subject,object
https://www.hydroshare.org/resource/aefabd0a6d7d47ebaa32e2fb293c9f8a#schemaorg,Courtney G Flint
https://www.hydroshare.org/resource/f94ac7f8d8a048cdbd2610dfa7cd315b#schemaorg,Zhiyu (Drew) Li
https://www.hydroshare.org/resource/f9a75c0b289649aa844e84c24f9f5780#schemaorg,Young-Don Choi
https://www.hydroshare.org/resource/173875a936f14c22a5ba19c721adfb86#schemaorg,Remi Dupas
https://www.hydroshare.org/resource/f1116211202a4c069919797272023e62#schemaorg,Nathan Swain
https://www.hydroshare.org/resource/6d80e4bd00244b5dabaff34074cd3102#schemaorg,Garrick Stephenson
https://www.hydroshare.org/resource/25133b13a1fc4fca9187c2d4e272d4e8#schemaorg,Jessie Myers
https://www.hydroshare.org/resource/ca0f2f0f28ba40018ae64b973e2bb35a#schemaorg,Ruth B. MacNeille
https://doi.org/10.4211/hs.88454dae8c604009b684bfa136e5f7f4#schemaorg,Celray James CHAWANDA
https://doi.org/10.4211/hs.1c6034be6886412ba59970ab1157fa7e#schemaorg,Bethany Neilson
for 5293 lines
Mec-iS commented 2 years ago

From the error message I can see you are running on GPUs, you got the same error running on plain CPUs? It looks like is complaining for the shape of the dataframe, I will look into it.

Mec-iS commented 2 years ago

I have collected all your code from issue #200 and this one into this Colab notebook or in this Github gist

If you have a Google account you can open and run it remotely (on a free tier). The code works there and you can check if the results returned are the ones expected by copying and modifying the notebook, let me know how it goes!

Please provide your local system's specs (OS, you are using a virtual environment, how did you install your packages) so I can understand the problem you have in running on your machine. The suggested way to have control on the packages installed is to use a Python Virtual Environment. Also I find really useful to use Jupyter Lab desktop application to manage package installation.

fils commented 2 years ago

@Mec-iS Thanks for your help on these two items.

The SPARQL does work in Colab so I suspect this is an issue with my GPU usage as you point out.

Is there a simple way to turn off GPU leveraging in a notebook? I honestly don't know how I would disable that for a specific case.

Mec-iS commented 2 years ago

sure. no problem.

The instantiation of the graph has an option to disable GPUs:

kg = kglab.KnowledgeGraph(
    name = "hydroshare",
    namespaces = ns,
    use_gpus=False
    )
fils commented 2 years ago

@Mec-iS Thanks!

Turning off the GPU removed the error. Thanks much (now I need to resolve my GPU issue, but that can wait for another time).

I'm not well disciplined in my python usage, I am using conda to set up my environment and you can find it here if you are interested further in this issues (https://github.com/gleanerio/notebooks/blob/master/environment.yml) but it's a bit large (see reference about lack of discipline )

You've resolved this problem though so feel free to close. I'll post new problems in new issues.. ;)

ceteri commented 2 years ago

Hi @fils ,

Which is the dataset that you're using? It's not on https://github.com/DerwenAI/kglab/issues/24 -- but another?

I can try to recreate the issue on my Linux laptop which has an NVIDIA GPU.

It may be that some underlying dependencies for RAPIDS have changed. They have a somewhat non-standard "release selector" which we haven't updated in several months https://rapids.ai/start.html

Mec-iS commented 2 years ago

the dataset is the one in #200, can be downloaded from s3.

fils commented 2 years ago

@ceteri @Mec-iS

I pushed up some of what I am working on (including the graphs) to https://github.com/gleanerio/notebooks/tree/master/Hydroshare

As noted, this should be the same graph as at the S3 (updated with prefix for schema.org).

I think the issue just may be the graph and the way I am approaching it not being the best. So I think things are working fine (sans the GPU issue I have.. which could be my install.. driver 495.29.05 by the way on a GTX 1050 Ti, nothing too special).

I'm trying to work up some ways people can inspect their schema.org based graphs around their datasets coming them implementing https://github.com/ESIPFed/science-on-schema.org/ guidance. So any course corrections or guidance would be more than welcome!

Thanks for your engagement with this..

charlesvardeman commented 2 years ago

So @ceteri, I think that you are correct on the RAPIDS release selector. We have RAPIDS installed on a development node of our gpu cluster using the following selector conda create -n rapids-21.12 -c rapidsai -c nvidia -c conda-forge \ cudf=21.12 cuml=21.12 cugraph=21.12 python=3.8 cudatoolkit=11.2

Running the example from the tutorial:

import kglab

namespaces = {
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    namespaces = namespaces,
    )

produces a similar error message to what @fils was seeing.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_2396895/1517367763.py in <module>
----> 1 kg.describe_ns()

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/kglab/kglab.py in describe_ns(self)
    254 
    255         if self.use_gpus:
--> 256             df = cudf.DataFrame(rows_list, columns=col_names)
    257         else:
    258             df = pd.DataFrame(rows_list, columns=col_names)

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/contextlib.py in inner(*args, **kwds)
     73         def inner(*args, **kwds):
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner
     77 

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/cudf/core/dataframe.py in __init__(self, data, index, columns, dtype)
    610                     )
    611                 else:
--> 612                     self._init_from_list_like(
    613                         data, index=index, columns=columns
    614                     )

/opt/anaconda3/envs/rapids-21.12/lib/python3.8/site-packages/cudf/core/dataframe.py in _init_from_list_like(self, data, index, columns)
    750         if columns is not None:
    751             if len(columns) != len(data):
--> 752                 raise ValueError(
    753                     f"Shape of passed values is ({len(index)}, {len(data)}), "
    754                     f"indices imply ({len(index)}, {len(columns)})."

ValueError: Shape of passed values is (31, 31), indices imply (31, 2).

The machine details are:

Wed Feb 16 14:53:35 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     Off  | 00000000:00:09.0 Off |                    0 |
| N/A   18C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   21C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   21C    P8    13W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 6000     Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   20C    P8    12W / 250W |      3MiB / 22698MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The node is running Red Hat Enterprise Linux release 8.5 (Ootpa), Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]

Mec-iS commented 2 years ago

@charlesvardeman please move your comment to the RAPIDS related discussion or open a new issue.

I close this as been resolved.

ceteri commented 2 years ago

@charlesvardeman @Mec-iS @fils:

I've opened another issue #229 specifically to track the updates we need to do for supporting RAPIDS