Closed andy-z closed 6 years ago
Few examples of how different applications store fila names in GEDCOM.
When saving GEDCOM it does not have option to save pictures next to it, instead it just saves full path name of the original image in GEDCOM file:
1 OBJE
2 FILE D:\home\documents\Drevo 4.files\Persons\5.jpg
3 FORM jpg
This version has an option to save copies of images together with GEDCOM. When option is unchecked then it behaves just as v4:
1 OBJE
2 FORM jpg
2 FILE D:\home\documents\Drevo5.files\Persons\5.jpg
When option is enabled it saves images in a separate directory with the name derived from the output GEDCOM file name by adding ".files" to the name (if GEDCOM is saved as drevo.ged
then pictures will be stored in drevo.ged.files
folder). GEDCOM file then contains relative paths of images (relative to the folder of output GEDCOM file):
1 OBJE
2 FORM jpg
2 FILE drevo.ged.files\Persons\2.jpg
This app also has an option to save images next to GEDCOM file. Without that option GEDCOM stores full paths of the original images:
1 OBJE
2 FORM jpg
2 FILE d:\documents\myheritage\drevo\Photos\P33_617_800.jpg
2 _PHOTO_RIN MH:P33
2 _FILESIZE 51279
When option is enabled it saves images in a folder next to the generated GEDCOM file in a folder with the name derived from base name of the GEDCOM file (without extension) by adding _Photos
to it (e.g. drevo.ged
images are stored in drevo_Photos
folder). GEDCOM file still has full path name of the image copies though:
1 OBJE
2 FORM jpg
2 FILE M:\drevo_Photos\P31_300_400.jpg
2 _PHOTO_RIN MH:P31
2 _FILESIZE 34142
No special options while saving GEDCOM, output files contains absolute path names of the original images:
1 OBJE
2 FORM jpeg
2 FILE /home/ivanov/Pictures/иванов.jpeg
GEDCOM is the primary data format for Ancestris. From experimenting with images it looks like Ancestris can use either relative or absolute path names depending on where the image is located w.r.t. GEDCOM file.
Ancestris can also save a copy of GEDCOM file and when saving it has an option to also copy image files, the location of the images after copy is not quite predictable (depends on the original path of the image), but saved GEDCOM file will contain relative path names of the images.
To summarize:
Trying to think about implementation algorithm which covers all possible cases, but let's start with the simplest cases first.
And assuming that files were not moved/renamed. This probably covers 99% of the use cases. Image names in file could be either absolute or relative, and relative names are w.r.t. folder of GEDCOM file.
The algorithm in this case is trivial:
img_path = ...
if os.path.isabs(img_path):
img = open(img_path, "rb")
else:
folder = os.path.dirname(os.path.abspath(gedcom_path))
img_path = os.path.join(folder, img_path)
img = open(img_path, "rb")
Absolute paths will still be valid, but relative paths will have to be treated w.r.t. folder that now needs to be specified:
img_path = ...
if os.path.isabs(img_path):
img = open(img_path, "rb")
else:
folder = options.image_folder
img_path = os.path.join(folder, img_path)
img = open(img_path, "rb")
Absolute image paths will all be broken. If GEDCOM file was moved together with image folder then relative names may still work OK. Question now is how to find file given its old path but searching in a specified folder and its sub-folders.
Simplest approach which is already implemented in ged2doc is to just search for an image base name, but that breaks if the same base name is found in more than one folder. Solution is probably to use not just a base name but longer path components.
Different host may even have different file naming rules, e.g. Linux vs Windows. If that is the case then I cannot even use os.path
on target machine to analyze paths in the file. Very likely I need to have my own parser for the paths which can guess the original naming conventions.
Assuming that ZIP file was created by user (by compressing GEDCOM and one or more image folders) then GEDCOM can sill have relative and absolute paths in it. "Paths" in ZIP do not directly map to paths on host OS, e.g. Windows uses backslash as path separator and ZIP uses slashes as separators. Searching in ZIP is going to be similar to case 4 above.
I suspect this simple algorithm might work in all above cases:
Searching is a bit more involved. Let's say that image folder is either a folder given on command line, GEDCOOM file folder, or ZIP file root "folder". Lets say some file has a path a\b\c\img.jpg
relative to that folder (or a/b/c/imp.jpg
on Linux or /a/b/c/imp.jpg
in ZIP). Depending on how files were moved and what folder was given on command line the path in GEDCOM file for the same image may look like:
C:\user\joe\Documents\Pictures\a\b\c\img.jpg
C:\user\joe\Documents\Pictures\c\img.jpg
c\img.jpg
a\b\c\img.jpg
Pictures\a\b\c\img.jpg
Pictures\c\img.jpg
/home/joe/Pictures/a/b/c/img.jpg
/home/joe/Pictures/c/img.jpg
c/img.jpg
a/b/c/img.jpg
Pictures/a/b/c/img.jpg
Pictures/c/img.jpg
Pictures/img.jpg
The algorithm the could work for all these case may look like:
[A-Za-z]:
)home:joe:Pictures:a:b:c:img.jpg
, joe:Pictures:a:b:c:img.jpg
, Pictures:a:b:c:img.jpg
, a:b:c:img.jpg
, b:c:img.jpg
, c:img.jpg
, img.jpg
])
:
with target path separator)All is taken care by PR #11
Image search algorithm can be improved for some standard situations to not require specification of -i option or for better handling of duplicate file names.