bwt/OpenPaperView - Githubissues

OpenPaperView

An OpenPaper.work mobile companion.

Or download the latest APK from the Releases Section.

Disclaimer 1 : This Android application only works with OpenPaper.work. It also requires a lot of setup. If you don't want to spend hours preparing a server (only to be disappointed because the application is not what you expected) you can use the demo mode of the application (and be disappointed right away)

Disclaimer 2 : This is a very niche project. You may well be the first to try to understand the following instructions. Please report errors, omissions, inaccuracies.

The whole system consists of 4 parts :

The OpenPaper.work installation : provides the documents and the OCR.
A Python script : builds an SQLite database used by the viewer.
An HTTPS server : queried by the viewer to get the DB and the documents.
The viewer Android app.

The basic idea is to build an SQLite database from the data collected by Paperwork and serve that database (and the actual scans) to the viewer over HTTPS.

Features

Filter on Paperwork labels (inclusive or exclusive).
Offline full text search.
Documents can be downloaded, so they are available offline. Downloaded documents are stored in the internal storage of the application.
Automatic download of documents selected by labels.
Any static HTTP server can be used, as long as it supports client authentication with certificates and cache control.
Documents can be given a title (through Paperwork's extra keywords feature).
Material Design 3. This ensures that every screen looks absolutely stunning despite my limited UI design skills.

Limitations

Only tested on Linux (Fedora for Paperwork, Debian for NGINX).
Everything is readonly, the viewer's function is to search and retrieve papers. No edits are possible, there are no plans to add any.
Image scans can be viewed online, but PDF must be downloaded first.
Modifications on PDFs (done with Paperwork) are ignored. Only the original PDF is used.
The internal PDF viewer is quite crude, e.g. there is no re-rendering when the zoom level changes. You can alternatively visualize the pdf with an external app.

Installation

An OpenPaper.work installation

This is probably the easiest part. You need to locate :

The papers content directory (the directory containing a lot of YYYYMMDD_HHMM_NN directories)
The database containing the result of the OCR. It is named doc_tracking.db and is located inside the Paperwork work directory.

The Python script

The tools/create_viewer_cb.py script must be executed periodically. It scans the papers directory, adds the OCRed text from the Paperwork database and create an SQLite database. It would be nice to be able to integrate it into OpenPaper.work. If you have the required skills, please help with this feature request

The only dependency I remember is PyPDF2 1.x (Fedora package python3-PyPDF2)

Parameters, like the input and output paths are defined in create_viewer_cb.config.

By default the full text of the documents is indexed and stored. The index is used for full text search, the text itself is used to show search result snippets.

In my case, each document increases the size of the database by about 10 kb :

A few hundred bytes for the basic data
3 kb for the text index
about 7 kb for the full text

To keep the DB small, it is possible to omit documents, partially or completely. See the labels section of the config file for more details.

An HTTP server

The server sends the document data and the SQLite file to the viewer.

It should support :

Client authentication with certificate. You don't want your documents to be publicly accessible.
Cache control. The viewer periodically checks if a new DB is available.

It needs access to the papers content and to the database built by the script. You may need to adjust the access right of the files generated by PaperWork.

Certificate creation

Server authentication is quite standard, and is not covered here.

Client authentication is less common, I used OpenSSL to create the necessary files.

I am not, by far, an OpenSSL expert. Please report mistakes, inaccuracies or bad practices.

The basic idea is to create an authority and use it to sign certificates. The authority's certificate will then be installed on the server, while a signed certificate (with corresponding private key) will be imported into the viewer.

Create the CA's private key. This should be kept in a secure place.
```
openssl genrsa -out ca_private.key 4096
```
Create the CA's (self signed) certificate. This is the file to be installed on the server.
```
openssl req -new -x509 -days 3660 -key ca_private.key -out ca.crt
```

Then for each client :

Create the private key :

openssl genrsa -out client_private.key 4096

Create a certificate request. You will be asked for a Common Name, it can be anything as long as it is not empty :
```
openssl req -new -key client_private.key -out client_request.csr
```
Sign the client's request with the CA's key, creating a certificate with a 10 years validity, the serial should be different for each certificate :
```
openssl x509 -req -days 3650 -in client_request.csr -CA ca.crt -CAkey ca_private.key -set_serial 1 -out client.crt
```
Create the PEM file to be imported in the viewer app :
```
cat client.crt client_private.key >client_full.pem
```

Configuration

A sample configuration for NGINX :

server {
    # compress the sqlite DB file
    gzip on;
    gzip_types application/octet-stream;

    # SSL configuration
    listen 443 ssl http2 default_server;
    listen [::]:443 ssl http2 default_server;

    # Server authentication :
    # The server's certificate and private key
    ssl_certificate certs/server.crt;
    ssl_certificate_key private/server.key;

    # Client authentication :
    # The CA signing the client's certificate
    ssl_client_certificate certs/ca.crt;

    # make verification optional, so we can display a 403 message to those
    # who fail authentication
    ssl_verify_client optional;

    root /var/www/;

    index index.html index.htm;

    server_name _;

    location / {
        deny all;
    }

    # this is the viewer's base URL
    # where it expects to find :
    # papers.sqlite
    # papers/
    location /papers_base_dir/ {
        # if the client-side certificate failed to authenticate, show a 403
        # message to the client
        if ($ssl_client_verify != SUCCESS) {
            return 403;
        }

        try_files $uri =404;
    }

OpenPaperView settings

Base URL

The viewer downloads the sqlite DB, the document images and pdf. For example if the base URL is https:example.com/paperwork/base the viewer will query :

The database :

https:example.com/paperwork/base/papers.sqlite

The documents thumbnail, images and pdf :

https:example.com/paperwork/base/papers/some_paper_id/doc.pdf
https:example.com/paperwork/base/papers/some_paper_id/paper.1.jpg
https:example.com/paperwork/base/papers/some_paper_id/paper.1.thumb.jpg

Auto download labels

Every time the database is updated, the documents having one of the labels will be downloaded. If manually deleted, they will be re-downloaded with the next update.

Authentication

Authentication is done through HTTPS with mutual authentication.

To authenticate itself on the server, the viewer needs a certificate and the corresponding private key. It expects a PEM file containing exactly one certificate and one private key. This typically looks like :

some optional description
-----BEGIN CERTIFICATE-----
Base64 encoded content
-----END CERTIFICATE-----

-----BEGIN PRIVATE KEY-----
more Base64 content
-----END PRIVATE KEY-----

You can optionaly add a certificate for a custom certification authority. This is used to authenticate the server and is only necessary if the server's certificate is not signed by a well known CA.
If provided, it will be the only CA trusted by the viewer. If not, Android's system (built-in) CAs will be trusted.
In any case Android's user CAs (i.e. manually imported on the device) are not trusted.

Extension

I found that with small screens it is not very practical to identify documents based on the thumbnail. Having a title is much more comfortable.

If the first line of Paperwork's extra keywords starts with a # the line is used as a title for the document.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.