ahrm / sioyek

Sioyek is a PDF viewer with a focus on textbooks and research papers
https://sioyek.info/
GNU General Public License v3.0
6.94k stars 228 forks source link

Copying ones sioyek pdf annotation to another similar pdf file #1129

Open michaelzmm opened 1 month ago

michaelzmm commented 1 month ago

I have some books with annotations, then somewhere along the way I find that the toc (table of contents) needs to be edited, i use a program called pdftocio to do that, which generates another pdf output file. I understand that this new file has a different hash tag, so sioyek will not find the annotations. Is there a way to copy the annotations from the old pdf file to the new one so I don't have to manually do it? Maybe a command line tool that you give the old and the new pdf and sioyek update its database to the new file hashtag? Thankyou dear.

ahrm commented 1 month ago

There is currently no way to do that but it is very easy to implement, I am on a trip right now but I will write a script to do this when I come back (maybe ping me in a week or so if I forgot to do it by then).

ahrm commented 1 month ago

I wrote a simple python script to do this:

import sqlite3
import sys
import argparse
import pathlib
import uuid

def create_arg_parser():
    parser = argparse.ArgumentParser(
        prog='sioyek_database_utils',
        )

    parser.add_argument('--local-database-file-path', type=str, help='Path to the local database file (local.db)')
    parser.add_argument('--shared-database-file-path', type=str, help='Path to the shared database file (shared.db)')
    parser.add_argument('--get-file-hash', type=str, help='Print the hash of the following file if it is found in the database.')
    parser.add_argument('--get-hash-path', type=str, help='Print the path(s) of the following hash if it is found in the database.')
    parser.add_argument('--list-files', action=argparse.BooleanOptionalAction)
    parser.add_argument('--from', type=str, help='Hash of the source file')
    parser.add_argument('--to', type=str, help='Hash of the destination file')
    return parser

if __name__ == '__main__':
    parser = create_arg_parser()
    args = parser.parse_args(sys.argv[1:])

    shared_database_file_path = args.shared_database_file_path
    local_database_file_path = args.local_database_file_path

    with sqlite3.connect(local_database_file_path) as conn:
        cursor = conn.cursor()
        cursor.execute('SELECT * FROM document_hash')
        documents = cursor.fetchall()
        column_names = [description[0] for description in cursor.description]

    docs = [dict(zip(column_names, book)) for book in documents]

    if args.list_files:
        for doc in docs:
            print(doc['hash'], ':', doc['path'])
    elif args.get_file_hash:
        found = False
        for doc in docs:
            if pathlib.Path(doc['path']).samefile(args.get_file_hash):
                print(doc['hash'])
                found = True
                break
        if not found:
            print('File not found in the database.')
    elif args.get_hash_path:
        for doc in docs:
            if doc['hash'] == args.get_hash_path:
                print(doc['path'])
                found = True
        if not found:
            print('Hash not found in the database.')
    elif getattr(args, 'from'):
        from_hash = getattr(args, 'from')
        to_hash = getattr(args, 'to')
        if to_hash is None:
            print('Destination hash (--to) is not provided.')
            sys.exit(1)

        table_names = [
            ('bookmarks', 'document_path'),
            ('highlights', 'document_path'),
            ('links', 'src_document'),
            ('marks', 'document_path')
             ]
        from_annotations = dict()

        for table, column_name in table_names:
            with sqlite3.connect(shared_database_file_path) as conn:
                cursor = conn.cursor()
                cursor.execute(f'SELECT * FROM {table} WHERE {column_name} = ?', (from_hash,))
                annotations = cursor.fetchall()
                column_names = [description[0] for description in cursor.description]
                from_annotations[table] = [dict(zip(column_names, bm)) for bm in annotations]

        for table, column_name in table_names:
            for annotation in from_annotations[table]:
                # drop id column
                annotation.pop('id')

                # we will specify the src hash column to be the hash of the new document
                annotation.pop(column_name)

                if 'uuid' in annotation:
                    # create a new uuid
                    annotation['uuid'] = '{' + str(uuid.uuid4()) + '}'

        # insert the new annotations to "to" document
        for table, column_name in table_names:
            with sqlite3.connect(shared_database_file_path) as conn:
                cursor = conn.cursor()
                for annotation in from_annotations[table]:
                    column_names = list(annotation.keys()) + [column_name]
                    column_values = list(annotation.values()) + [to_hash]
                    cursor.execute(f'INSERT INTO {table} ({", ".join(column_names)}) VALUES ({", ".join(["?" for _ in column_names])})', column_values)

        print('Copied ')
        print(len(from_annotations['highlights']), 'highlights')
        print(len(from_annotations['bookmarks']), 'bookmarks')
        print(len(from_annotations['links']), 'portals')
        print(len(from_annotations['marks']), 'marks')
        print(f'from {getattr(args, "from")} to {getattr(args, "to")}')

You can use it like this:

python copy_annotations.py --local-database-file-path <path to local.db> --shared-database-file-path <path to shared.db> --list-files

Lists all of your files along with their hashes. Now you can copy the annotations like so:

python copy_annotations.py --local-database-file-path <path to local.db> --shared-database-file-path <path to shared.db> --from <hash of the source file> --to <hash of the dest file>
michaelzmm commented 1 month ago

Hello ahrm, I tried your script, look what it came out:

michaelzmm~/syn$ sioyek_copy --shared-database-file-path /home/michaelzmm/syn/lib/shared.db --list-files import-im6.q16: attempt to perform an operation not allowed by the security policy PS' @ error/constitute.c/IsCoderAuthorized/426. import-im6.q16: attempt to perform an operation not allowed by the security policyPS' @ error/constitute.c/IsCoderAuthorized/426. import-im6.q16: attempt to perform an operation not allowed by the security policy PS' @ error/constitute.c/IsCoderAuthorized/426. import-im6.q16: attempt to perform an operation not allowed by the security policyPS' @ error/constitute.c/IsCoderAuthorized/426. import-im6.q16: attempt to perform an operation not allowed by the security policy PS' @ error/constitute.c/IsCoderAuthorized/426. /home/michaelzmm/syn/scripts/sioyek_copy: line 7: syntax error near unexpected token(' /home/michaelzmm/syn/scripts/sioyek_copy: line 7: `def create_arg_parser():'

Maybe is something on my side, I could not figure it out though..

ahrm commented 1 month ago

You need to run it using python. for example:

python sioyek_copy.py --shared-database-file-path /home/michaelzmm/syn/lib/shared.db --list-files
michaelzmm commented 1 month ago

Yes, "sioyek_copy" is an alias i made for "python sioyek_copy.py"

michaelzmm commented 1 month ago

Oh, I forgot to put the "#!/usr/bin/python3" on the first line. So that was the answer to the problem shown above.

michaelzmm commented 1 month ago

I could get the list of files from database using the --local-database-file-path only, when using the --shared-database-file-path only, I got error.

Tried to use the option --get-file-hash like so: python3 sioyek_copy --local-database-file-path /home/michaelzmm/.local/share/sioyek/local.db --get-file-hash 'path-to-book.pdf'

Got error when iterating docs, when it tries to pathlib.Path(doc['path']) on file paths that doesn't exist anymore.

Let me do an observation, from the script listing, I could see that the database stores the hash and file path. Why does it stores the file path, if it is something thats breakable, meaning the file can be moved somewhere else?

Also I could see all the books that I dont have anymore stored in the database. But this problem is something that can only be solvable scanning the disk for books and deleting the entry if the system does not find the pdf file with that hashing. Maybe that is another cleanup script to be made.

Thank you. Sorry If I am bothering. I intent to further fully test the script in a need case still.

ahrm commented 1 month ago

Why does it stores the file path, if it is something thats breakable, meaning the file can be moved somewhere else?

How else could we open the file if we don't have its path?

Also I could see all the books that I dont have anymore stored in the database

Yes, this is not a really a big issue though, in fact we detect these files when we are showing the recent documents window and we could delete them there, but it is just not worth it.

when using the --shared-database-file-path only, I got error

You should provide both shared database and local database paths.

Also if your local.db and shared.db files don't contain private data you could email them to me (along with the files you want to migrate) and I could fix the files.

michaelzmm commented 4 weeks ago

Why does it stores the file path, if it is something thats breakable, meaning the file can be moved somewhere else?

How else could we open the file if we don't have its path?

Actually the user will give the file path when he wants sioyek to open the program. So the program will receive a filepath of the file, it will read it and get the hash. With the hash, the program will find it in the db. So I dont see why store the file path in the db.

ahrm commented 4 weeks ago

So I dont see why store the file path in the db.

So we can show a list of recently used documents (and open them).

michaelzmm commented 4 weeks ago

Should sioyek really have this functionality at all? Fells like its overstepping its boundaries. The thing is, if you rename your pdf, it breakes. Its a fragile thing.

I would like to suggest an implementation in the code. I find myself needing to keep the "deltas" differences from the physical/printed/published scanned book pages to the digital page variable. Maybe a variable ppd (physical page delta), where the user can set up and be stored in the database. To be able to open a book from the command line giving argument like -ppn (physical page number) 44, and sioyek gets this number, diminishes the delta and opens in page x (the digital page). It solves the problem for me from having to keep this information somewhere else. The case in which I use this is to be able to go to that books page from a link file, like this [./path/to/book.pdf?p=44]. Another use case is when I have a bib reference file with pages from physical books, to be able to open them, when the books are also digitally available as scanned pdfs.

ahrm commented 4 weeks ago

Should sioyek really have this functionality at all? Fells like its overstepping its boundaries.

Most PDF viewers have this feature.

michaelzmm commented 3 weeks ago

Should sioyek really have this functionality at all? Fells like its overstepping its boundaries.

Most PDF viewers have this feature.

Yes I know. It may be viewed as a warning for your program not to become like most pdfs viewers in the future.