FrankensteinVariorum / fv-visualization

Visualizing changes across editions of Frankenstin
MIT License
0 stars 0 forks source link

function to pull xml from UMD archive #3

Open mdlincoln opened 5 years ago

mdlincoln commented 5 years ago

Probably best to set up the code to pull directly from the latest online versions anyway, whether it's the local frankenstein data or the UMD data, so that we have one pipeline for dealing with everything

ebeshero commented 5 years ago

@mdlincoln We're already doing this for the CETEIcean interface--check with @raffazizzi about it. A couple of notes, though: 1) For collating purposes, our process in this project is resequencing the UMD data to move margin zones into reading sequence (otherwise they're at the ends of the files). So I have a version of the UMD files that resequences that. Latest online versions should be pulled into the collation process when we do a new collation, but this is probably not going to change very much. 2) We don't use my resequenced code in the interface. Instead, @raffazizzi is pulling in the UMD TEI directly, and using the FV spine data to highlight moments of variation in it. There's some fine-tuning to be done on that pointing, but it's part of our project to be working directly with UMD data.

ebeshero commented 5 years ago

In effect, what we have are two versions of the UMD data--one that's remixed that we're at the moment only using for collation, and the other that comes direct from the source. The standoff_Spine directory in fv-data has <ptr> elements targeting the UMD data directly for each moment of variation.

Those standoff_Spine files also contain, at each locus of variation, for each <rdgGrp> (or cluster of editions that agree on a passage), a list of the normalized word tokens for that passage. That will give you a normalized view of the passage--basically what we gave the collation software for it to understand what was altered. That's an easy way to access how our project processed the UMD data, and comes packaged with collation info, like the edit distances from each point to the next. Hope that helps...Basically the standoff_Spine files point to the original UMD source and show you how our project processed a given passage.

mdlincoln commented 5 years ago

Thanks for those notes @ebeshero. For this week, I'm just trying to prototype some quick static visualizations, possibly for use in the grant app, so I may hold off on trying to pull in the UMD files for now just to make my life a bit simpler. But I would ultimately like to be able to point directly to the UMD files when generating this visualization, just as you hope to do so with the CETEIcean interface.

I had a related question regarding the ptr target values in the current spine files: while all of pointers for the edition chunks reference a simple xml id, the ones for the UMD are arbitrary xpath queries. I understand that the UMD files require those complex selectors - that's not an issue. But it seems inconsistent to have all the pointers use the structure target="path#suffix" but a subset of those paths require treating suffix as an id (i.e in pseudo-code, xml_find($path, xpath = ".//*[@id='$suffix']")) but the UMD ones assume that suffix is an entire xpath query (i.e. xml_find($path, xpath = $suffix))

It's likely I'm missing something, of course...

ebeshero commented 5 years ago

Right, the two methods of pointing are different: that's because we're able to plant @xml:ids in the other (non-UMD) editions as part of the up-conversion process after collation: The other edition files in the Variorum get reconstructed to hold data from the collation. The UMD files, though, remain untouched, so the reach with XPath is definitely more complicated. If I'm following you, you're noting that the syntax is pointing at a variable that's constructed differently, and yes, that's the case. @raffazizzi will have more to say about how that works. It's definitely working to retrieve the files but sometimes the pointer resolution is reaching a little far to the left or right of the passage in cases where we've had to calculate how to count characters around deletions and insertions--there's some fine tuning to be done (part of what we're writing for in the grant).

FYI, @Rikkm , @raffazizzi , and I are pounding out the first rough draft of the grant application right now for a first round draft review due tomorrow 12/4...and I think we're interpreting tomorrow liberally as the end of the day. (We don't necessarily need visuals like instantly, but we're working with deadlines of the end of this week for the university grant offices). Anyway, whatever you're able to develop is wonderful and will certainly help us! (Thank you!!!) :-) The application is ultimately due in early January, so I bet we can slide some late-breaking visuals in over the course of the month.

ebeshero commented 5 years ago

Note-- edited the above to be a little more informative. :-)

mdlincoln commented 5 years ago
  1. re: xpath/xquery calls - ok, it sounds like we're on the same page! I wanted to make sure I hadn't missed something. Definitely worth more discussion, but probably in the new year.
  2. re: grant - I aim to have some sample visuals (just static, and VERY prototype-y) by the early January deadline just so that we can show that we know at least a bit about what we want to do and that we have the tech chops to do it. I'll share a few early results by the end of this week.
ebeshero commented 5 years ago

@mdlincoln Yay! thank you very much! :-) And also thanks for nudging us with good questions--it really helps the writing process for everything (including these grant apps). There's a LOT of documentation we need to write.

raffazizzi commented 5 years ago

Hi, sorry to jump in a bit late. @ebeshero explained everything :+1: I'll just add this about the collation pointers: retrieving an ID is a very simple operation in DOM, so we use that mechanism when we point to ids directly. For S-GA only we need XPointer in order to grab arbitrary chunks of text. The JavaScript code for the demo implements XPointer just enough to resolve our predictable pointers.

The S-GA data comes directly from the project's GitHub.

mdlincoln commented 5 years ago

My issue is that your design (https://github.com/PghFrankenstein/pghf-reader/blob/master/src/actions/index.js#L113) requires someone know ahead of time that witness #fMS has target values that need to be parsed with a totally different mechanism.

Ideally, either the source files are set up so that you can have one unified target interface, OR the spine files need to have a new attribute that specifies how their node's target needs to be parsed. For simplicity of use, I think the former is a more sane solution.

raffazizzi commented 5 years ago

That's a good point and I'll keep it in mind for the next phase of development. I figure another way to proceed could be determined by the shape/content of @target instead of relying on #fMS. It just seems overkill to use and implement XPointer for a simple XML ID reference and TEI ascribes both datatypes to the same attribute.

mdlincoln commented 5 years ago

It just seems overkill to use and implement XPointer for a simple XML ID reference and TEI ascribes both datatypes to the same attribute.

Agreed, but this has implications for future adoption of this innovative spine architecture. Should the onus should fall on:

  1. the source TEI publisher to retrofit / create derivatives of their documents with new elements wrapping the desired spans of text, so that the spine only has to do the work of pointing to these elements with xml:id refs?
  2. the spine creator/user to manage a variety of xml search&retrieval methods, so that the spine can be written for any already-existing TEI file?
mdlincoln commented 5 years ago

My initial impression, fwiw, is it should be option 1: that the spine should offer one consistent interface for pointing to other XML files, and that at worst it should use XPath queries, and avoid the overhead of requiring an end user to use a library that can handle XQuery/XPointer.

If the target files are so complicated as to necessitate the power of XPointer, then generate a derivative TEI file that materializes those complex queries as simply-referenced xml elements, so lookups from the spine are consistently simple.

raffazizzi commented 5 years ago
  1. the source TEI publisher to retrofit / create derivatives of their documents with new elements wrapping the desired spans of text, so that the spine only has to do the work of pointing to these elements with xml:id refs?

This issue is at the heart of the spine data model: 1. is to be avoided in order to increase interoperability between TEI projects

  1. the spine creator/user to manage a variety of xml search&retrieval methods, so that the spine can be written for any already-existing TEI file?

Ideally we want the spine to do just that. We didn't need to point to S-GA with complex pointers, but we wanted to show it is possible to use TEI data without "owning" it. Think of them as IIIF targets to specific regions; you can pre-generate tiles if you know the use case, but otherwise you're better off with a system that can process more complex targeting systems (e.g. via coordinates and rotation angles for images, string ranges for text, measures and beats for music notation, etc.)

All this is just about what the model does, but it is not unusual in TEI-land to adopt a model and write specific implementations that only fulfill specific project requirements.

If we wanted to create a fuller implementation of the spine data model, it should support all pointing mechanisms that TEI's @target supports, including XPointer and fragment references via ID (that is xsd:anyURI). Or we can choose a subset of those (e.g. direct ID reference and XPointer's string-range()).

mdlincoln commented 5 years ago

re: https://github.com/PghFrankenstein/fv-data/issues/8 I actually shouldn't even be worrying about the spine for visualization purposes.

That said, yes, if option 2 is the use priority, then the model needs to not only support multiple types of pointing mechanisms but also have some kind of attribute indicating which mechanism it's going to use.

ebeshero commented 5 years ago

I'm re-opening this, since I think we now want to add an attribute to indicate the kind of pointing mechanism we're using.

raffazizzi commented 5 years ago

the model needs to not only support multiple types of pointing mechanisms but also have some kind of attribute indicating which mechanism it's going to use

That's a very good suggestion. We could do that with a @type attribute on <ptr>, though it needs to "characterize the element in some sense" rather than the target itself, so we could have something like

<ptr type="toElement" target="IDREF" />
<!-- or -->
<ptr type="toStringRange" target="XPOINTER" />

The ODD ( and the schema) can define the type values and we can use schematron to enforce correspondance between @type and @target. Likewise, implementations can use these resources programmatically to determine behavior, but an hardcoded switch on @type would be faster.

ebeshero commented 5 years ago

@raffazizzi I'm thinking we need to open a new issue on one of the other FV repos about this--probably in my postCollation processing where we're first generating the spine, right? If I remember this right, I'm first generating the spine structure, with pointers to everything except S-GA. In the same XSLT I could be planting the @type attribute to indicate which kind of pointer we're using...(then you pick up those files and process to add the XPointers to S-GA).

raffazizzi commented 5 years ago

@ebeshero those two steps are handled by the same script after our last round of work, so we can assign @types there, it should be fairly simple

mdlincoln commented 5 years ago

@ebeshero since this discussion is now really about fv-postCollation, can you use your administrator privileges to transfer the issue over to that repo?

https://help.github.com/articles/transferring-an-issue-to-another-repository/