andy-z / ged2doc

Tools for converting GEDCOM data into document formats.
MIT License
6 stars 2 forks source link

Image search improvement #10

Closed andy-z closed 6 years ago

andy-z commented 6 years ago

Image search algorithm can be improved for some standard situations to not require specification of -i option or for better handling of duplicate file names.

andy-z commented 6 years ago

Few examples of how different applications store fila names in GEDCOM.

Drevo v4 (Windows)

When saving GEDCOM it does not have option to save pictures next to it, instead it just saves full path name of the original image in GEDCOM file:

1 OBJE
2 FILE D:\home\documents\Drevo 4.files\Persons\5.jpg
3 FORM jpg

Drevo v5 (Windows)

This version has an option to save copies of images together with GEDCOM. When option is unchecked then it behaves just as v4:

1 OBJE
2 FORM jpg
2 FILE D:\home\documents\Drevo5.files\Persons\5.jpg

When option is enabled it saves images in a separate directory with the name derived from the output GEDCOM file name by adding ".files" to the name (if GEDCOM is saved as drevo.ged then pictures will be stored in drevo.ged.files folder). GEDCOM file then contains relative paths of images (relative to the folder of output GEDCOM file):

1 OBJE
2 FORM jpg
2 FILE drevo.ged.files\Persons\2.jpg

MyHeritage Family Tree Builder (Windows)

This app also has an option to save images next to GEDCOM file. Without that option GEDCOM stores full paths of the original images:

1 OBJE
2 FORM jpg
2 FILE d:\documents\myheritage\drevo\Photos\P33_617_800.jpg
2 _PHOTO_RIN MH:P33
2 _FILESIZE 51279

When option is enabled it saves images in a folder next to the generated GEDCOM file in a folder with the name derived from base name of the GEDCOM file (without extension) by adding _Photos to it (e.g. drevo.ged images are stored in drevo_Photos folder). GEDCOM file still has full path name of the image copies though:

1 OBJE
2 FORM jpg
2 FILE M:\drevo_Photos\P31_300_400.jpg
2 _PHOTO_RIN MH:P31
2 _FILESIZE 34142
andy-z commented 6 years ago

GRAMPS (Linux)

No special options while saving GEDCOM, output files contains absolute path names of the original images:

1 OBJE
2 FORM jpeg
2 FILE /home/ivanov/Pictures/иванов.jpeg
andy-z commented 6 years ago

Ancestris (on Linux)

GEDCOM is the primary data format for Ancestris. From experimenting with images it looks like Ancestris can use either relative or absolute path names depending on where the image is located w.r.t. GEDCOM file.

Ancestris can also save a copy of GEDCOM file and when saving it has an option to also copy image files, the location of the images after copy is not quite predictable (depends on the original path of the image), but saved GEDCOM file will contain relative path names of the images.

andy-z commented 6 years ago

To summarize:

andy-z commented 6 years ago

Trying to think about implementation algorithm which covers all possible cases, but let's start with the simplest cases first.

1. Reading file on the same machine as it was produced

And assuming that files were not moved/renamed. This probably covers 99% of the use cases. Image names in file could be either absolute or relative, and relative names are w.r.t. folder of GEDCOM file.

The algorithm in this case is trivial:

img_path = ...
if os.path.isabs(img_path):
    img = open(img_path, "rb")
else:
    folder = os.path.dirname(os.path.abspath(gedcom_path))
    img_path = os.path.join(folder, img_path)
    img = open(img_path, "rb")

2. Same machine but GEDCOM file moved

Absolute paths will still be valid, but relative paths will have to be treated w.r.t. folder that now needs to be specified:

img_path = ...
if os.path.isabs(img_path):
    img = open(img_path, "rb")
else:
    folder = options.image_folder
    img_path = os.path.join(folder, img_path)
    img = open(img_path, "rb")

3. Images were moved

Absolute image paths will all be broken. If GEDCOM file was moved together with image folder then relative names may still work OK. Question now is how to find file given its old path but searching in a specified folder and its sub-folders.

Simplest approach which is already implemented in ged2doc is to just search for an image base name, but that breaks if the same base name is found in more than one folder. Solution is probably to use not just a base name but longer path components.

4. Everything was moved to different host

Different host may even have different file naming rules, e.g. Linux vs Windows. If that is the case then I cannot even use os.path on target machine to analyze paths in the file. Very likely I need to have my own parser for the paths which can guess the original naming conventions.

5. Searching in a ZIP files

Assuming that ZIP file was created by user (by compressing GEDCOM and one or more image folders) then GEDCOM can sill have relative and absolute paths in it. "Paths" in ZIP do not directly map to paths on host OS, e.g. Windows uses backslash as path separator and ZIP uses slashes as separators. Searching in ZIP is going to be similar to case 4 above.

andy-z commented 6 years ago

I suspect this simple algorithm might work in all above cases:

Searching is a bit more involved. Let's say that image folder is either a folder given on command line, GEDCOOM file folder, or ZIP file root "folder". Lets say some file has a path a\b\c\img.jpg relative to that folder (or a/b/c/imp.jpg on Linux or /a/b/c/imp.jpg in ZIP). Depending on how files were moved and what folder was given on command line the path in GEDCOM file for the same image may look like:

The algorithm the could work for all these case may look like:

  1. Strip Windows drive from the image path if present ([A-Za-z]:)
  2. Split image path on slash or backslash
  3. Loop by gradually stripping leftmost component of the path (e.g. [home:joe:Pictures:a:b:c:img.jpg, joe:Pictures:a:b:c:img.jpg, Pictures:a:b:c:img.jpg, a:b:c:img.jpg, b:c:img.jpg, c:img.jpg, img.jpg])
    • search for this path inside image folder recursively (replace : with target path separator)
    • if exactly one file is found - success!
    • if more than one file found - give a warning and stop searching (no image is found)
    • otherwise strip next leftmost piece and loop
andy-z commented 6 years ago

All is taken care by PR #11