FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
359 stars 67 forks source link

Add `PageRegion` and `PageRectangleRegion` source reference qualifiers #333

Closed berry120 closed 3 years ago

berry120 commented 3 years ago

I'm looking at a project where I may make reasonably heavy use of interlinking different sources, and what first drew me to this format was the abiltiy to specify "regions" in a source which is fantastic - I haven't found anywhere else that allows that out of the box.

I wonder however if you'd consider making page-aware qualifiers part of the standard (since some types, such as PDF, naturally span multiple pages):

Just for completeness, the specific use case here is digitally referencing a whole bunch of family documents / photos and relating them to each other. I have detailed scans of things such as family photo albums - but I have separate scans of the individual photos in them as well as separate scans of each page (this is important as the pages themselves sometimes contain annotations outside of photos, and I'd like to also preserve the "look" of the original album.) Currently the plan is to relate them in gedcomx by having a PDF of the album at a page level, and then relating the individual images to this PDF by pointing the componentOf field in the SourceDescription of each individual image to the main PDF - but qualifying the position of those images would then require the page level qualifiers.

(I'm aware that I could just break the standard's recommendation and specify my own qualifiers anyway, but I thought this may be useful in a general case hence the Github issue. Happy to raise a PR if others think this is a good idea.)

stoicflame commented 3 years ago

Thanks for the PR! Well done.

Two comments:

  1. Do we need the PageRectangleRegion qualifier if we've defined both the PageRegion and the RectangleRegion? My inclination would be to just have two qualifiers on the source reference (a page and a rectangle region).
  2. Is the work Region redundant in PageRegion? I wonder if we could just say Page? I feel like the word "Region" isn't redundant in the other qualifiers because they're each specifying some kind of a "start" and "end" boundary. But "page" has boundaries by definition, no?
berry120 commented 3 years ago

Agree with both of those points - I've updated the PR to match.

(Initially I missed that the qualifiers were always specified as a list, rather than individually, hence the addition of PageRectangleRegion.)

thomast73 commented 3 years ago

When a multipage document has pages that are not numbered, an absolute page number is the way to go. And certainly, an absolute page number is easy for a computer to semantically interpret. But I worry about human interpretation of the absolute page number when the multipage document has been numbered for human consumption. When a reference for page 1 of a book that has 20+ pages of preface material numbered in Roman numerals is consumed by a human, the human will look for page 1 after the Roman numbered pages. So my question is: could we allow for both an absolute page number and a document-specific page number?

berry120 commented 3 years ago

@thomast73 It's a fair point, I wondered about this too, and the emphasis we place on the raw file being human readable (as oppose to an application using it being human readable.)

I decided against it here because I think it introduces a lot of complexity for little gain. We'd either have to have the qualifier differentiate between "raw" and "labelled" page numbers dynamically (which doesn't seem too reliable so I don't think that would be the best idea), introduce a more complicated syntax to the qualifier to differentiate between the two somehow, or introduce a separate qualifier for "raw" and "labelled" page numbers completely (and make them mutually exclusive.)

If we go with "absolute" page numbers in the spec, then the only real disadvantage is that it's not immediately clear to someone reading the raw file. I'd say this is quite rare though - by far the most common use case is going to involve a user viewing the data through some kind of backend processing & presentation layer, which would easily be capable of taking the raw page number, looking up the "labelled" page number if different, and then showing that to the user instead.

So in short I think always defaulting to the "absolute" page number is the better thing to do on account of being both simpler in the spec, and still enabling an application to show the "absolute" or "labelled" page number to the user as it sees fit. I'm very open to be challenged on any of the above if I'm wrong however - my only firm point would be that we definitely need to define it unambiguously!

thomast73 commented 3 years ago

@berry120 Perhaps we should consider a qualifier name that is less ambiguous, and have it carry than meaning you have defined. This would leave the door open for another qualifier in the future and would also help in making your proposal more specific? AbsolutePage? I'm not in love with that name, but would prefer it over "raw...".

stoicflame commented 3 years ago

Although @berry120 makes a good case for the "absolute" page being assumed/default. It seems reasonable that we could just keep the simple name Page for the assumed/default case and if we ever add support for something other than the "absolute" page, we can make the qualifier name for that new thing more specific and descriptive.

berry120 commented 3 years ago

Personally I'm not too hung up on the name, whether that's Page, AbsolutePage, RawPage or whatever else - happy to just go with the consensus on that one 👍

berry120 commented 3 years ago

@stoicflame Just wanted to check if there's any more thoughts on the page name or anything else that needs to happen before this is ready for merge?

stoicflame commented 3 years ago

I think we got about as much feedback as we're going to get. Let's merge!