CopperEagle / SmartFileLibrary

SmartFileLibrary is an AI-supported digital library, backed by a local database on PostgreSQL and an experimental web interface.
MIT License
2 stars 1 forks source link
ai analysis document library library-automation library-database pytorch transformers

SmartFileLibrary

Python Version License: MIT

The problem this project in development tries to fix are local folders full of PDFs, code snippets and datasets, that typically have very nontelling filenames (like DOI numbers).

SmartFileLibrary is is an digital library, backed by a local database hosted by PostgreSQL. The project's goal is to semiautomatically insert documents from local directories into the library. This includes LLM based analysis of documents to infer metadata and supply keywords. Semiautonomous insertion can be augmented for scientific papers using the API for the crossref metadata database.

To strengthen privacy, the project strives to always offer local compute options.

Screenshot of the experimental webinterface

Features

The project is ongoing. Some features are still in development.

Currently available

Ongoing work

Installing

The project is written for Linux, Python 3.10+ and has a number of additional requirements

Optionally, you may setup a virtual environment.

python3 -m venv path/to/new/venv
cd path/to/new/venv
source bin/activate

Then download the code, navigate into that directory and run

pip3 install -r requirements.txt
pip3 install .

Basic Use

Setup

First, you need to setup an account over on PostgreSQL and a database. You likely do not want to use the default database which is named after the user account.

Manual insertion

The fallowing demonstrates the fully manual insertion of a publisher and a book into the DB. Note that all actions are being logged into a file called locallog.txt. This allows you to clear the DB later on and replay your previous actions.

from smartfilelibrary import DatabaseInterface

# Enter credentials to local DB.
db = DatabaseInterface("dbname", "user", "password")
# remove any previous tables and insertions
# Good practice to reset any counters.
db.cleardb()

## If you clear, you may want to replay previous actions:
# db.executefile(locallog.txt) 

# Inserts a number of standard values.
# Will likely throw an error if you execute from file before,
# given that file also has seen the standardsetup.
# TLDR: Either standardsetup or executefile.
db.standardsetup()

# Add publisher, returns ID, required for adding books
apress = db.addpublisher("Apress")

# Add topics and subtopics
db.subtopic('Data Science', 'Database')
db.subtopic('Database', 'SQL')

# Add book and a file corresponding to this book (there may be many files per book)
sqlbook = db.addbook('Expert Performance Indexing in Azure SQL and SQL Server 2022', 
    2023, apress, 'book', ('SQL', ))
db.addfile(sqlbook, "path/to/book1.pdf", 300, "First Half")
db.addfile(sqlbook, "path/to/book2.pdf", 349, "Second Half")

# Commit all changes
# db.commit_transaction()

# Commit all changes and close connection
db.finish()

The above registers a publisher, then a book by giving the title, publishing date, publisher, form and keywords. Then, a book consists of one or several files, one is added with book_id, path and number of pages.

Now this all seems pretty boring to do, right? We may want to speed this process up a notch. This project is still at the beginning of doing so.

Semiautomated process

The tutorial for the semiautomated process was moved into its own file here.

Webinterface

Webinterface support is still experimental. It is in active development. After executing the below, it will request the password for the user.

python3 -m smartfilelibrary db_name user_name 

Then, you can open the webinterface. It has mainly been tested in Chrome and Firefox.

The DB

The DB layout can be checked in setup.sql. It is in third normal form. It contains the fallowing "objects":

Then, it also contains the reasonable relations:

The other relations like form_book, being either one-to-one or one-to-many have been folded into the object tables.

Limitations

Attribution

The icon is by Afif Fudin