Open mdlincoln opened 5 years ago
@mdlincoln We're already doing this for the CETEIcean interface--check with @raffazizzi about it. A couple of notes, though: 1) For collating purposes, our process in this project is resequencing the UMD data to move margin zones into reading sequence (otherwise they're at the ends of the files). So I have a version of the UMD files that resequences that. Latest online versions should be pulled into the collation process when we do a new collation, but this is probably not going to change very much. 2) We don't use my resequenced code in the interface. Instead, @raffazizzi is pulling in the UMD TEI directly, and using the FV spine data to highlight moments of variation in it. There's some fine-tuning to be done on that pointing, but it's part of our project to be working directly with UMD data.
In effect, what we have are two versions of the UMD data--one that's remixed that we're at the moment only using for collation, and the other that comes direct from the source. The standoff_Spine directory in fv-data has <ptr>
elements targeting the UMD data directly for each moment of variation.
Those standoff_Spine files also contain, at each locus of variation, for each <rdgGrp>
(or cluster of editions that agree on a passage), a list of the normalized word tokens for that passage. That will give you a normalized view of the passage--basically what we gave the collation software for it to understand what was altered. That's an easy way to access how our project processed the UMD data, and comes packaged with collation info, like the edit distances from each point to the next. Hope that helps...Basically the standoff_Spine files point to the original UMD source and show you how our project processed a given passage.
Thanks for those notes @ebeshero. For this week, I'm just trying to prototype some quick static visualizations, possibly for use in the grant app, so I may hold off on trying to pull in the UMD files for now just to make my life a bit simpler. But I would ultimately like to be able to point directly to the UMD files when generating this visualization, just as you hope to do so with the CETEIcean interface.
I had a related question regarding the ptr
target
values in the current spine files: while all of pointers for the edition chunks reference a simple xml id, the ones for the UMD are arbitrary xpath queries. I understand that the UMD files require those complex selectors - that's not an issue. But it seems inconsistent to have all the pointers use the structure target="path#suffix"
but a subset of those paths require treating suffix
as an id (i.e in pseudo-code, xml_find($path, xpath = ".//*[@id='$suffix']")
) but the UMD ones assume that suffix
is an entire xpath query (i.e. xml_find($path, xpath = $suffix)
)
It's likely I'm missing something, of course...
Right, the two methods of pointing are different: that's because we're able to plant @xml:ids
in the other (non-UMD) editions as part of the up-conversion process after collation: The other edition files in the Variorum get reconstructed to hold data from the collation. The UMD files, though, remain untouched, so the reach with XPath is definitely more complicated. If I'm following you, you're noting that the syntax is pointing at a variable that's constructed differently, and yes, that's the case. @raffazizzi will have more to say about how that works. It's definitely working to retrieve the files but sometimes the pointer resolution is reaching a little far to the left or right of the passage in cases where we've had to calculate how to count characters around deletions and insertions--there's some fine tuning to be done (part of what we're writing for in the grant).
FYI, @Rikkm , @raffazizzi , and I are pounding out the first rough draft of the grant application right now for a first round draft review due tomorrow 12/4...and I think we're interpreting tomorrow liberally as the end of the day. (We don't necessarily need visuals like instantly, but we're working with deadlines of the end of this week for the university grant offices). Anyway, whatever you're able to develop is wonderful and will certainly help us! (Thank you!!!) :-) The application is ultimately due in early January, so I bet we can slide some late-breaking visuals in over the course of the month.
Note-- edited the above to be a little more informative. :-)
@mdlincoln Yay! thank you very much! :-) And also thanks for nudging us with good questions--it really helps the writing process for everything (including these grant apps). There's a LOT of documentation we need to write.
Hi, sorry to jump in a bit late. @ebeshero explained everything :+1: I'll just add this about the collation pointers: retrieving an ID is a very simple operation in DOM, so we use that mechanism when we point to ids directly. For S-GA only we need XPointer in order to grab arbitrary chunks of text. The JavaScript code for the demo implements XPointer just enough to resolve our predictable pointers.
The S-GA data comes directly from the project's GitHub.
My issue is that your design (https://github.com/PghFrankenstein/pghf-reader/blob/master/src/actions/index.js#L113) requires someone know ahead of time that witness #fMS
has target
values that need to be parsed with a totally different mechanism.
Ideally, either the source files are set up so that you can have one unified target
interface, OR the spine files need to have a new attribute that specifies how their node's target
needs to be parsed. For simplicity of use, I think the former is a more sane solution.
That's a good point and I'll keep it in mind for the next phase of development. I figure another way to proceed could be determined by the shape/content of @target
instead of relying on #fMS
. It just seems overkill to use and implement XPointer for a simple XML ID reference and TEI ascribes both datatypes to the same attribute.
It just seems overkill to use and implement XPointer for a simple XML ID reference and TEI ascribes both datatypes to the same attribute.
Agreed, but this has implications for future adoption of this innovative spine architecture. Should the onus should fall on:
xml:id
refs?My initial impression, fwiw, is it should be option 1: that the spine should offer one consistent interface for pointing to other XML files, and that at worst it should use XPath queries, and avoid the overhead of requiring an end user to use a library that can handle XQuery/XPointer.
If the target files are so complicated as to necessitate the power of XPointer, then generate a derivative TEI file that materializes those complex queries as simply-referenced xml elements, so lookups from the spine are consistently simple.
- the source TEI publisher to retrofit / create derivatives of their documents with new elements wrapping the desired spans of text, so that the spine only has to do the work of pointing to these elements with
xml:id
refs?
This issue is at the heart of the spine data model: 1. is to be avoided in order to increase interoperability between TEI projects
- the spine creator/user to manage a variety of xml search&retrieval methods, so that the spine can be written for any already-existing TEI file?
Ideally we want the spine to do just that. We didn't need to point to S-GA with complex pointers, but we wanted to show it is possible to use TEI data without "owning" it. Think of them as IIIF targets to specific regions; you can pre-generate tiles if you know the use case, but otherwise you're better off with a system that can process more complex targeting systems (e.g. via coordinates and rotation angles for images, string ranges for text, measures and beats for music notation, etc.)
All this is just about what the model does, but it is not unusual in TEI-land to adopt a model and write specific implementations that only fulfill specific project requirements.
If we wanted to create a fuller implementation of the spine data model, it should support all pointing mechanisms that TEI's @target
supports, including XPointer and fragment references via ID (that is xsd:anyURI). Or we can choose a subset of those (e.g. direct ID reference and XPointer's string-range()).
re: https://github.com/PghFrankenstein/fv-data/issues/8 I actually shouldn't even be worrying about the spine for visualization purposes.
That said, yes, if option 2 is the use priority, then the model needs to not only support multiple types of pointing mechanisms but also have some kind of attribute indicating which mechanism it's going to use.
I'm re-opening this, since I think we now want to add an attribute to indicate the kind of pointing mechanism we're using.
the model needs to not only support multiple types of pointing mechanisms but also have some kind of attribute indicating which mechanism it's going to use
That's a very good suggestion. We could do that with a @type
attribute on <ptr>
, though it needs to "characterize the element in some sense" rather than the target itself, so we could have something like
<ptr type="toElement" target="IDREF" />
<!-- or -->
<ptr type="toStringRange" target="XPOINTER" />
The ODD ( and the schema) can define the type values and we can use schematron to enforce correspondance between @type
and @target
. Likewise, implementations can use these resources programmatically to determine behavior, but an hardcoded switch on @type
would be faster.
@raffazizzi I'm thinking we need to open a new issue on one of the other FV repos about this--probably in my postCollation processing where we're first generating the spine, right? If I remember this right, I'm first generating the spine structure, with pointers to everything except S-GA. In the same XSLT I could be planting the @type
attribute to indicate which kind of pointer we're using...(then you pick up those files and process to add the XPointers to S-GA).
@ebeshero those two steps are handled by the same script after our last round of work, so we can assign @type
s there, it should be fairly simple
@ebeshero since this discussion is now really about fv-postCollation, can you use your administrator privileges to transfer the issue over to that repo?
https://help.github.com/articles/transferring-an-issue-to-another-repository/
Probably best to set up the code to pull directly from the latest online versions anyway, whether it's the local frankenstein data or the UMD data, so that we have one pipeline for dealing with everything