ArneBachmann / tagsplorer

A quick and resource-efficient OS-independent tagging filetree extension tool and library
Mozilla Public License 2.0
3 stars 1 forks source link

Linux: Build Status Windows: Build status Test coverage: Coverage Status

tagsPlorer

A quick and resource-efficient OS-independent tagging folder tree extension tool and library written in Python. tagsPlorer is licensed under the Mozilla Public License version 2.0, which can also be found in the LICENSE file.

Each folder name of the indexed file tree is implicitly treated as a tag name, and additional tags can be set or excluded on singular files or file glob patterns. Contents of other folders in the file tree can virtually be mapped into the tree. The entire system is fully backwards-compatible with the common file tree metaphor present in most file systems.

Initial example

If you happen to manage your data in a tree-like manner, as most often the case in most any contemporary file system, your data may look similar like that:

/personal/money/tax/2016
/personal/money/tax/2017
/personal/money/invoice/2016
/personal/money/invoice/2017
/personal/travel/2011/hawaii
/personal/travel/2011/new york
/work/projects/archive
/work/projects/current/communication

This is just an example, but gives you the general idea. In each folder, you have files of varying types like office documents, media, or others. tagsPlorer allows you to create virtual "views" over your files allowing your to ask for "all Word documents from 2016" via tp .docx 2016, or "all money-related spreadsheets files from 2017" via tp money 2017 .xlsx.

Problem statement

Nowadays most operating systems and file system browsers still adhere to the tree of files metaphor, or try to employ some kind of integrated search engine to find and access existing files. Both approaches have noticeable drawbacks that could be solved by adding a thin optional tagging layer.

Problems with file trees

Each file belongs to exactly one parent folder, which practically prohibits a file to be found under more than one category, one exception being Windows 7's libraries, which are virtual views over several folders.

This can be solved locally by using soft or hard links, if the operating system and file system support them, or storing the same file several times in different folders. In that case, however, you either lose independence from the operating system, and/or compatibility with version control systems becomes clunky and error-prone, or you have lots of duplication that might lead to human errors and increased storage demands.

Problems with search engines

Some desktop systems come with (sometimes even semantic) search engines that continuously crawl and index your files, plus guess and suggest what you might want to access next. Here the problem lies in the system overhead and loss of control - you don't know if you are presented the relevant file (or version of tit) and if the search terms are correct, and you lose oversight of your actual underlying file system structure.

One solution: tagsPlorer to the rescue!

tagsPlorer uses a simple concept which enables you to

and still benefit from only little manual maintenance or additional setup. The benefit will increase even more with graphical tools using this library and with tighter integration into the OS or desktop system of your choice.

History

The author has been attempting to write similar utilities several times in the past, namely taggify, tagtree, and tagtree2. This is his latest take at this persistent problem of convenient semi-sewmantic data archiving and retrieval by using space-efficient ahead-of-time indexing. There are similarities to the Linux find and grep utilites, which are performant but crawl the entire folder tree on every search, and don't support virtual folder mapping. There exist other projects with similar goals, e.g. TMSU.

Usage

Hint: Currently it's not possible to glob over folder names efficiently; trying to do so requires tagsPlorer to walk all folders that are left after an initial preselection process, or will have no effect when preselection picks up a set of potential folder matches before that step.

Command-line interface

Try running tagsPlorer with the PyPy Python 3 distribution.

The current only user interface is the console script tp (or python3 -m tagsplorer.tp, a thin yet streamlined layer over the library's basic functions. Glob patterns support using * and ? to match any character sequence or single character, but not character lists and ranges like [abc] or [a-z]; also make sure to quote the globs correctly for your shell.

Here is a short description of valid program options and arguments:

--update or -U

Update the file index by walking the entire folder tree from the root down to leaf folders. This creates or updates (in fact replaces) the index in .tagsplorer.idx with the file system state respecting the current configuration. As this file will be written over on every index run, there is no need to track outdated items or perform memory management in the index. This simplifies the entire software model.

Architecture and program semantics

Search algorithm

In general, what the program does is simple boolean set operations. The indexer maps tags (which include file and folder name (constituents), user-specified tags, and file extensions) to folders, with the risk of false positives (it's an over-generic, optimistic index that links folders with both inclusive or exclusive manual tags plus tags mapped from other folders, plus file extension information). After determination of potential folders in a first search step, their contained file names are filtered by potential further tags and inclusive or exclusive file name patterns. This step always operates on the actual currently encountered files, not on any indexed and potentially outdated state, to ensure correctness of output filtered data. If a mapped folder is excluded by a negative tag, its contents can still be found by the name of the positive tags of the mapping. TODO check if true.

Configuration file

Using the tagsPlorer's -i option, we can create an empty configuration file which also serves as the marker for the file tree's root (just like the existence of version control systems' .svn or .git folders). Usually all access to the configuration file should be performed through the tp command or directly via tagsplorer/lib.py library functions. The configuration file follows mainly the format and structure of Windows' INI-file, but without any interpolation nor substitution (avoiding Python's built-in ConfigParser to enable multiple keys). The first line contains a timestamp to ensure that the matching index file tagsplorer.idx file is not outdated.

The root section contains global configuration options that can be set and queried by the --set, --unset, --get and --clear commands.

Each section points to a path, including the root section [], and specifies any number of occurences of the following options:

Global settings are stored under the root section []:

Marking folders

Individual folders can be marked as being ignored or to skip indexing all their children. This can be done in three different ways:

  1. Place a marker file .tagsplorer.ign or .tagsplorer.skp into the folder. Please note, that the filename is case-sensitive for all platforms.
  2. Manually create a configuration setting inside the index configuration file .tagsplorer.cfg under the key [/<path>]: ignore= or skip=
  3. Manually create a global configuration setting inside the index configuration file .tagsplorer.cfg under the root key []: ignored=<glob> or skipd=<glob>

Internal storage format

The indexer class contains the following data structures:

There are two further intermediate data structures used during indexing:

Tagging semantics

TODO What happens if a file with a tag gets mapped into the current folder, where the same tag excludes that file? Or the other way around? This currently cannot happen, as all folders are processed individually and then get merged into a single view, with duplicates being removed. There is no real link to the originating folder for the folder list, as we have the concept of virtual (tag) folders in a unified view.

Design decisions regarding linking on the file system level

If files are hard-linked between different locations in the file tree and are submitted to the version control system, they won't be linked when checking out at different locations, and modifying one instance will result on several linked copies being modified on the original file system when updated. This leads to all kinds of irritating errors or VCS conflicts.

  1. Option: tagsPlorer has to intercept update/checkout and re-establish file links according to its metadata (configuration). This is hard to guarantee and communicate.
  2. Option: Add ignore details to the used VCS (.gitignore or SVN ignore list) for all linked (or rather mapped) files. The danger here is of course to ignore files that later could be added manually, and not being able to distinguish between automatically ignored files, and those that the user wants to ignore on purpose.
  3. Option: As by the current design the snapshot *.idx file is not persisted to the VCS (TODO add ignore markers automatically), all links can be recreated on first file tree walk (as option 1), even if linked files were earlier submitted as separate files, the folder walk would re-establish the link (potentially asking the user to confirm linking forcing to choose one master version, of issueing a warning for diverging file contents).

Other design decisions

Development

Git and Github workflows

The main branch should always run fine and contain the latest stable version (release). Currently we are still nefore V1.0 therefore everything is still happening either on main or on other branches without announcement. Development activities are merged on the develop branch, and only merged to main for a release, which is then tagged. If any releases are build in the future (e.g. for pip or conda installation), they would only be build from commits that pass all tests on e.g. Travis CI or AppVeyor.

Known issues

TODOs