cadmiumkitty / rdfpandas

RdfPandas is a module providing RDF support for Pandas
MIT License
53 stars 8 forks source link

NamespaceManager #21

Closed a1012 closed 3 months ago

a1012 commented 1 year ago

@cadmiumkitty Hi, I am trying to convert pandas dataframe(.csv format) to rf graph in .ttl format. While converting, I am facing an issue: NameError: name 'NamespaceManager' is not defined

The code I am using is : `from rdfpandas.graph import to_graph import pandas as pd import rdflib

df = pd.read_csv('/content/sample_data/NER_test.csv', keep_default_na = False) namespace_manager = NamespaceManager(Graph()) namespace_manager.bind('skos', SKOS) namespace_manager.bind('rdfpandas', Namespace('http://github.com/cadmiumkitty/rdfpandas/')) g = to_graph(df, namespace_manager) s = g.serialize(format = 'turtle')`

The csv file file is attached: NER_test.csv

Please help me for the same. Moreover, I will be using generated file to interact with BioCypher.

cadmiumkitty commented 1 year ago

Hi @a1012,

There seems to be two issues.

First, you need to import NamespaceManager before you can use it: from rdflib.namespace import NamespaceManager. I may need to fix the example in README (won't recall why it worked, maybe NamespaceManager got moved to another package in Rdflib); will do it in the next couple of days.

Second, the CSV that you attached won't convert to RDF with the code you shared. You need to use @id column header to map to subject resource identifier and appropriate headers for other columns to map to predicate resource identifiers. This is probably a good example: https://github.com/cadmiumkitty/anzsic-taxonomy/blob/main/anzsic.csv

Hope it helps.

a1012 commented 1 year ago

Hi @cadmiumkitty Thank you so much for helping me! I have triplets (entity,category,relationship ) in dataframe columns format and struggling to convert it into .ttl file so that I can use it further in biocypher to create knowledge graph. I am really new to rdf format so could you please explain :Second, the CSV that you attached won't convert to RDF with the code you shared. You need to use @id column header to map to subject resource identifier and appropriate headers for other columns to map to predicate resource identifiers?

I didn't understand the format in the shared(This is probably a good example: https://github.com/cadmiumkitty/anzsic-taxonomy/blob/main/anzsic.csv) example. So could you please provide any material to understand the rdf format(what each identifier or terminology mean ) to create for my use-case .

I mean , I am not able to understand the structure and how to put in my use-case. because when I put index_col = '@id' while reading csv as shown in below code

df = pd.read_csv('/content/sample_data/NER_test.csv',index_col = '@id', keep_default_na = False)

I usually get error: ValueError: Index @id invalid

cadmiumkitty commented 1 year ago

Hi @a1012,

I'd start with RDF Primer here https://www.w3.org/TR/rdf11-concepts/

My second point is that the CSV that you read into Pandas DataFrame to convert to Rdflib Graph and serialize as Turtle (.ttl) should follow a particular convention. The convention is described in the documentation for the to_graph method: https://rdfpandas.readthedocs.io/en/latest/rdfpandas.html#rdfpandas.graph.to_graph

Row indices are used as subjects, and column indices as predicates (I use @id column for indices and specify it when reading CSV into Pandas Data Frame with read_csv). Object types are inferred from the column index pattern of predicate{rdfLib Identifier instance class name}(type)[index]@language. Index numbers simply create additional statements as opposed to attempting to construct a new rdfs:List or rdfs:Container.

The example I shared https://github.com/cadmiumkitty/anzsic-taxonomy/blob/main/anzsic.csv follows that convention in that it has @id column for row indices to be used as subjects, other columns to use as predicates, and values in the cells to use as objects (literals or URIs) - Rdfpandas simply build a lot of subject-predicate-object triples from the DataFrame.