DeanLight / spannerlib

https://deanlight.github.io/spannerlib/
Apache License 2.0
3 stars 1 forks source link

Welcome to Spannerlib

Welcome to the spannerlib project.

The spannerlib is a framework for building programming languages that are a combination of imperative and declarative languages. This combination is based off of derivations of the document spanner model.

Currently, we implement a language called spannerlog over python. spannerlog is an extension of statically types datalog which allows users to define their own ie functions which can be used to derive new structured information from relations.

The spannerlog repl, shown bellow is served using the jupyter magic commands

Bellow, we will show you how to install and use spannerlog through Spannerlib.

For more comprehensive walkthroughs, see our tutorials section.

Installation

Unix

To download and install RGXLog run the following commands in your terminal:

git clone https://github.com/DeanLight/spannerlib
cd spannerlib

pip install -e .

download corenlp to spannerlib/rgxlog/

from this link

# verify everything worked
# first time might take a couple of minutes since run time assets are being configured
python nbdev_test.py

docker

git clone https://github.com/DeanLight/spannerlib
cd spannerlib

download corenlp to spannerlib/rgxlog/

from this link

docker build . -t spannerlib_image

# on windows, change `pwd to current working directory`
# to get a bash terminal to the container
docker run --name swc --rm -it \
  -v `pwd`:/spannerlib:Z \
  spannerlib_image bash

# to run an interactive notebook on host port 8891
docker run --name swc --rm -it \
  -v `pwd`:/spannerlib:Z \
  -p8891:8888 \
  spannerlib_image jupyter notebook --no-browser --allow-root

#Verify tests inside the container
python /spannerlib/nbdev_test.py

Getting started - TLDR

Here is a TLDR intro, for a more comprehensive tutorial, please see the introduction section of the tutorials.

import spannerlib
import pandas as pd
# get dynamic access to the session running through the jupyter magic system
from spannerlib import get_magic_session
session = get_magic_session()

Get a dataframe

lecturer_df = pd.DataFrame(
    [["walter","chemistry"],
     ["linus", "operating_systems"],
     ['rick', 'physics']
    ],columns=["name","course"])
lecturer_df
name course
0 walter chemistry
1 linus operating_systems
2 rick physics

Or a CSV

pd.read_csv('sample_data/example_students.csv',names=["name","course"])
name course
0 abigail chemistry
1 abigail operation systems
2 jordan chemistry
3 gale operation systems
4 howard chemistry
5 howard physics

Import them to the session

session.import_rel("lecturer",lecturer_df)
session.import_rel("enrolled","sample_data/enrolled.csv",delim=",")

They can even be documents

documents = pd.DataFrame([
    ["abigail is happy, but walter did not approve"],
    ["howard is happy, gale is happy, but jordan is sad"]
])
session.import_rel("documents",documents)
%%spannerlog
?documents(X)
'?documents(X)'
X
abigail is happy, but walter did not approve
howard is happy, gale is happy, but jordan is sad

Define your own IE functions to extract information from relations

# the function itself, writing it as a python generator makes your data processing lazy
def get_happy(text):
    """
    get the names of people who are happy in `text`
    """
    import re

    compiled_rgx = re.compile("(\w+) is happy")
    num_groups = compiled_rgx.groups
    for match in re.finditer(compiled_rgx, text):
        if num_groups == 0:
            matched_strings = [match.group()]
        else:
            matched_strings = [group for group in match.groups()]
        yield matched_strings

# register the ie function with the session
session.register(
    "get_happy", # name of the function
    get_happy, # the function itself
    [str], # input types
    [str] # output types
)

rgxlog supports relations over the following primitive types * strings * spans * integers

Write a rgxlog program (like datalog but you can use your own ie functions)

session.remove_all_rules()
%%spannerlog
# you can also define data inline via a statically typed variant of datalog syntax
new sad_lecturers(str)
sad_lecturers("walter")
sad_lecturers("linus")

# and include primitive variable
gpa_doc = "abigail 100 jordan 80 gale 79 howard 60"

# define datalog rules
enrolled_in_chemistry(X) <- enrolled(X, "chemistry").
enrolled_in_physics_and_chemistry(X) <- enrolled_in_chemistry(X), enrolled(X, "physics").

# and query them inline (to print to screen)
# ?enrolled_in_chemistry("jordan") # returns empty tuple ()
# ?enrolled_in_chemistry("gale") # returns nothing
# ?enrolled_in_chemistry(X) # returns "abigail", "jordan" and "howard"
# ?enrolled_in_physics_and_chemistry(X) # returns "howard"

lecturer_of(X,Z) <- lecturer(X,Y), enrolled(Z,Y).

# use ie functions in body clauses to extract structured data from unstructured data

# standard ie functions like regex are already registered
student_gpas(Student, Grade) <- 
    rgx("(\w+).*?(\d+)",$gpa_doc)->(StudentSpan, GradeSpan),
    as_str(StudentSpan)->(Student), as_str(GradeSpan)->(Grade).

# and you can use your defined functions as well
happy_students_with_sad_lecturers_and_their_gpas(Student, Grade, Lecturer) <-
    documents(Doc),
    get_happy(Doc)->(Student),
    sad_lecturers(Lecturer),
    lecturer_of(Lecturer,Student),
    student_gpas(Student, Grade).

And query it

%%spannerlog
?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)
'?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)'
Stu Gpa Lec
abigail 100 linus
gale 79 linus
howard 60 walter

You can also get query results as Dataframes for downstream processing

df = session.export(
    "?happy_students_with_sad_lecturers_and_their_gpas(Stu,Gpa,Lec)")
df
Stu Gpa Lec
0 abigail 100 linus
1 gale 79 linus
2 howard 60 walter

Additional Resources

Relevant papers