Closed sonofmun closed 4 years ago
Are the Travis tests not required for merging?
This is a deeper dive than I was expecting. I have created specific collection prototypes for the new guidelines for readable and collection. These appear to be working (at least with my data). I have also started working on a specific local parser that follows the new guidelines. I expect that next week I will add both of those to this PR.
That's wonderful ;)
So I have the basic functionality of the new metadata prototypes and a new resolver based on the new CapiTainS guidelines working. I think my code is still pretty ugly but they both represent another step away from the dependence on CTS. I have written no test up to now because I assume that at least some of the classes will change significantly before it is ready for release.
A list of the major changes that spring to mind that are contained here:
str
representation instead. It may be that we will want a new class to reproduce some of the functionality of the URN class. But that seemed way too dependent on the URN formatting for the previous guidelines to be very useful here.If I think of any more major changes that slipped my mind, I will add them as more comments here.
I have done some more work here, cleaning up unnecessary classes and functions, refactoring of function and parameter names that were dependent on the previous guidelines (e.g., anything with 'work' or 'textgroup' in the name), and a bit more generalization for the local resolver functions. There are still some extra, unused classes in resources.prototypes.capitains.collection. I am leaving those there for now. I can remove them once it is clear that they will not be needed.
Do you need some input from me here ?
I am mostly interested here in whether you think that I am on the right track with what I am doing here. If you generally think that things are OK, then I will go ahead and write unit tests for this stuff. And then we discuss the details once the PR if fully ripe.
I honestly think you are, I am wondering about some little things (given how DTS work, should we keep the dispatcher which was a hack for CTS resources ? Or should we keep it ?). Probably we should have a test repo where we look at what it creates with the current status.
I would lean toward keeping the things we have and adding more stuff, as long as it doesn't make the code base completely unwieldy. I would like to preserve as much backward compatibility as possible. If we can do that, then we only have one version to maintain instead of needing to maintain the old one in order to accommodate users who don't want to update their corpora to the new guidelines. In any case, I will now move over to writing unit tests and finishing the integration of these changes into Nautilus and HookTest.
Perfect !
There is a problem with how the resolver deals with children of a collection that is referred to twice. Let's start with examples of the __capitains__.xml
files.
The original collection would look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../../capitains.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<collection xmlns:ti="http://chs.harvard.edu/xmlns/cts"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/ns/capitains"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:bib="http://bibliotek-o.org/1.0/ontology/"
xmlns:cts="http://chs.harvard.edu/xmlns/cts"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<identifier>urn:cts:formulae:passau</identifier>
<dc:title xml:lang="deu">Die Traditionen des Hochstifts <span class="collection-origin">Passau</span></dc:title>
<dc:type>cts:textgroup</dc:type>
<structured-metadata>
<bib:AbbreviatedTitle>Traditionen Passau</bib:AbbreviatedTitle>
</structured-metadata>
<members>
<collection path="./heuwieser0073/__capitains__.xml"
identifier="urn:cts:formulae:passau.heuwieser0073"/>
<collection path="./heuwieser0083/__capitains__.xml"
identifier="urn:cts:formulae:passau.heuwieser0083"/>
</members>
</collection>
Then, the file where this file is part of another collection would look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../../capitains.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<collection xmlns:ti="http://chs.harvard.edu/xmlns/cts"
xmlns:dct="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://purl.org/ns/capitains"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:bib="http://bibliotek-o.org/1.0/ontology/"
xmlns:cts="http://chs.harvard.edu/xmlns/cts"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<identifier>a:different.identifier</identifier>
<dc:title xml:lang="deu">This is a collected collection</dc:title>
<dc:type>cts:textgroup</dc:type>
<structured-metadata>
<bib:AbbreviatedTitle>Collected</bib:AbbreviatedTitle>
</structured-metadata>
<members>
<collection path="../passau/heuwieser0083/__capitains__.xml" identifier="urn:cts:formulae:passau.heuwieser0083"/>
<collection readable="true" path="../passau/heuwieser0073/passau.heuwieser0073.lat001.xml">
<identifier>urn:cts:formulae:passau.heuwieser0073.lat001</identifier>
<parent>urn:cts:formulae:passau.heuwieser0073</parent>
<dc:title xml:lang="deu">Die Traditionen des Hochstifts Passau (Ed. Heuwieser) Nr. 73</dc:title>
<dc:description xml:lang="deu">ZEUGENAUSSAGEN (AUFZEICHNUNGEN) ÜBER DEN DER PASSAUER KIRCHE GEHÖRIGEN BESITZ DER KLÖSTERCHEN ROTT (POSTMÜNSTER) UND RIMBACH, ZU UTTENDORF UND OBERBIERBACH (BAYERBACH).</dc:description>
<dc:language>lat</dc:language>
<dc:type>cts:edition</dc:type>
<dc:contributor>Prof. Dr. Philippe Depreux (Universität Hamburg)</dc:contributor>
<dc:contributor>Franziska Quaas (Universität Hamburg)</dc:contributor>
<dc:contributor>Margherita Mariani (Studentische Hilfskraft, Universität Hamburg)</dc:contributor>
<dc:contributor>Matthew Munson (Universität Hamburg)</dc:contributor>
<dc:contributor>Morgane Pica (Praktikantin, Ecole nationale des Chartes, Paris, Frankreich)</dc:contributor>
<dc:publisher xml:lang="mul">Formulae-Litterae-Chartae Projekt</dc:publisher>
<dc:format>application/tei+xml</dc:format>
<dc:source>Die Traditionen des Hochstifts Passau Nr. 73, in: Max Heuwieser, Die Traditionen des Hochstifts Passau, München 1930 1969, [URI: http://d-nb.info/458436127], S. 61-62.</dc:source>
<structured-metadata>
<dct:abstract xml:lang="deu"/>
<bib:editor>Heuwieser, Max</bib:editor>
<dct:dateCopyrighted>1930 1969</dct:dateCopyrighted>
<dct:created/>
<dct:bibliographicCitation>Die Traditionen des Hochstifts Passau Nr. 73, in: Max Heuwieser, Die Traditionen des Hochstifts Passau, München 1930 1969, [URI: http://d-nb.info/458436127], S. 61-62.</dct:bibliographicCitation>
</structured-metadata>
</collection>
<collection readable="true" path="../passau/heuwieser0073/passau.heuwieser0073.lat005.xml">
<identifier>urn:cts:formulae:passau.heuwieser0073.lat005</identifier>
<parent>urn:cts:formulae:passau.heuwieser0073</parent>
<dc:title xml:lang="deu">Die Traditionen des Hochstifts Passau (Ed. Heuwieser) Nr. 73</dc:title>
<dc:description xml:lang="deu">ZEUGENAUSSAGEN (AUFZEICHNUNGEN) ÜBER DEN DER PASSAUER KIRCHE GEHÖRIGEN BESITZ DER KLÖSTERCHEN ROTT (POSTMÜNSTER) UND RIMBACH, ZU UTTENDORF UND OBERBIERBACH (BAYERBACH).</dc:description>
<dc:language>lat</dc:language>
<dc:type>cts:edition</dc:type>
<dc:contributor>Prof. Dr. Philippe Depreux (Universität Hamburg)</dc:contributor>
<dc:contributor>Franziska Quaas (Universität Hamburg)</dc:contributor>
<dc:contributor>Margherita Mariani (Studentische Hilfskraft, Universität Hamburg)</dc:contributor>
<dc:contributor>Matthew Munson (Universität Hamburg)</dc:contributor>
<dc:contributor>Morgane Pica (Praktikantin, Ecole nationale des Chartes, Paris, Frankreich)</dc:contributor>
<dc:publisher xml:lang="mul">Formulae-Litterae-Chartae Projekt</dc:publisher>
<dc:format>application/tei+xml</dc:format>
<dc:source>Die Traditionen des Hochstifts Passau Nr. 73, in: Max Heuwieser, Die Traditionen des Hochstifts Passau, München 1930 1969, [URI: http://d-nb.info/458436127], S. 61-62.</dc:source>
<structured-metadata>
<dct:abstract xml:lang="deu"/>
<bib:editor>Heuwieser, Max</bib:editor>
<dct:dateCopyrighted>1930 1969</dct:dateCopyrighted>
<dct:created/>
<dct:bibliographicCitation>Die Traditionen des Hochstifts Passau Nr. 73, in: Max Heuwieser, Die Traditionen des Hochstifts Passau, München 1930 1969, [URI: http://d-nb.info/458436127], S. 61-62.</dct:bibliographicCitation>
</structured-metadata>
</collection>
</members>
</collection>
Both of these files refer to the __capitains__.xml
file in the passau/heuwieser0083/
folder. These two collections are then parsed as XmlCapitainsCollectionMetadata
, which produces the following representations of their children:
The first file:
[XmlCapitainsCollectionMetadata(urn:cts:formulae:passau.heuwieser0073)#140265110261888, XmlCapitainsCollectionMetadata(urn:cts:formulae:passau.heuwieser0083)#140265110264856]
The second file:
[XmlCapitainsReadableMetadata(urn:cts:formulae:passau.heuwieser0073.lat001)#140265110265584, XmlCapitainsReadableMetadata(urn:cts:formulae:passau.heuwieser0073.lat005)#140265110265136, XmlCapitainsCollectionMetadata(urn:cts:formulae:passau.heuwieser0083)#140265110263904]
So Passau 83 in the first collection is represented by the object XmlCapitainsCollectionMetadata(urn:cts:formulae:passau.heuwieser0083)#140265110264856
and in the second collection by the object XmlCapitainsCollectionMetadata(urn:cts:formulae:passau.heuwieser0083)#140265110263904
. Both of these object, however, have the same id
, i.e., 'urn:cts:formulae:passau.heuwieser0083'.
Then, when the resolver dispatches the two collection with this code: https://github.com/Capitains/MyCapytain/blob/c93b86cb1ddbab106320a6f0b211e1a67c917022/MyCapytain/resolvers/capitains/local.py#L206-L219
it correctly adds the children to the original collection, since this collection was dispatched first, but when it comes to the second collection, the object from the first collection is what is called with the line https://github.com/Capitains/MyCapytain/blob/c93b86cb1ddbab106320a6f0b211e1a67c917022/MyCapytain/resolvers/capitains/local.py#L213 and that is then updated with the children from the second collection. The object from the second collection, however, is never updated with its children. So that means that the second collection is never followed all the way to its readable descendants.
Regarding our conversation earlier today @PonteIneptique , I don't think that having dictionaries and then parsing later will solve this problem. Once you parse an XmlCapitainsCollectionMetadata
, it is the object that is set as the member and not the identifier. So the descendants need to be added to that object, not just to the identifier.
Perhaps the way to go would be to replace the actual objects from the second collection with the objects from the first, but I am not sure what the best way to do this is. Or even if it is possible with the system as it now is.
One way to deal with that is that the Collection object parse method should have a dictionary which would provide already parsed elements, that way, it would prevent parsing a second time.
Hmm. How would I do that without making XmlCapitainsCollectionMetadata
, which is where the parsing of the XML happens, dependent on the resolver? XmlCapitainsCollectionMetadata
doesn't know whether anything else has been parsed. And I don't know if it should.
Make it an optional parameter that is used only if it is present (or in this case, created if it is not)
def parse(xml, parsed=None):
if parsed is None:
parsed = {}
# Then use it and transmit it to further parsing, this way, if you end up with a cycle (children connected to grand parents), it does not reparse !
Note that the same way the resolver could be optional. And I think it would be fine, because it might actually simplify a lot of things to make those dependants. Except for hooktest, things are always looked at through the lens of a resolver. And the resolver might be the best way to handle relationships between objects !
I think I like this second option (of handling relationships through the resolver) better. That would mean that XmlCapitainsCollectionMetadata
would only carry metadata at the collection level and wouldn't know anything about its children, or at least not until the resolver injected that information into the collection. I think it is a good idea to keep the information about a collections descendants on the collection object itself. But I think it is a good idea to add these things to the collection through the resolver and not during the parsing of the collection itself.
I will start to work in this direction and see what I can come up with.
Note that I do not think they are both contradictory, I think they might even work better together :)
I have come up with a solution. Let me know what you think @PonteIneptique , especially from a performance standpoint. I will write unit tests over the next couple of days to make sure that it does what it is supposed to do.
I have reactivated the ability within XmlCapitainsCollectionMetadata.parse
that the collection be connected to its children. This still works with the resolver since the resolver will update the dictionary of children and overwrite those children with the ones that should be used within the resolver environment.
Question @PonteIneptique : Should a readable text be listed twice under readableDescendants
if it occurs in two different collections, e.g., in its original collection and in a collected collection that pulls members from this original collection? This deals with the case already illustrated above.
Question @PonteIneptique : Should a readable text be listed twice under
readableDescendants
if it occurs in two different collections, e.g., in its original collection and in a collected collection that pulls members from this original collection? This deals with the case already illustrated above.
Nope, readableDescendants are thought to be unique, because of the way CTS worked. This is not documented though. It should be an ordered set in some way :)
From a Python standpoint, I don't think you can have a set
of CapitainsReadableMetadata
objects since CapitainsReadableMetadata
objects are not hashable. It shouldn't be difficult to work around this, however.
And what about plain descendants
? Can those be repeated in that case above?
Same situations. Regarding non-duplicability, that's why dictionaries are useful x[id] = Coll
-> x.values()
will be a set of collection ;)
Quick question @PonteIneptique : In MyCapytain 2, when retrieving a textual node, there was the idea of the canonical text, meaning that if you sent a work-level URN to __getText__
, it would return the text from one of the editions of that work.
The question is whether we want this functionality for version 3.0. Right now, if you request the text from an object that is not readable
, __getText__
throws and error saying the object is not readable. Should it, instead, grab a random readableDescendant
of the object? Since we no longer have the concept of edition
vs. translation
encoded in the object itself, we can't request an edition.
My feeling is we should keep it throwing an error and let the implementers deal with that as they see fit, e.g., catch the error and return a text of their choosing, instead of hard-coding behavior in there.
OK, so the get_label
problem was not solved, at least not in Python 3.5. Check out the code below to see the problem:
https://github.com/Capitains/MyCapytain/blob/585510d0bc1cc0035508b9d54059a144a7386ca9/tests/resolvers/guidelines_v3/test_local.py#L593-L659
The language label is actually present on the Literal
object value in the graph, as is shown by this output:
for predicate, obj in node.metadata.graph[node.metadata.asNode()]:
try:
print(obj, 'LANGUAGE', obj.language)
except:
continue
OUTPUT
Salzburger Urkundenbuch (Ed. Hauthaler); Codex A Nummer 100 LANGUAGE deu
Hauthaler, Willibald LANGUAGE None
a) Der Edle Vodalhard (Odalhard) übergibt dem Erzbischof 7 Huben am Ergoltsbach (Zufluss der kleinen Laber n. Landshut) mit Vorbehalt von 3 Joch in jeder Zelch und einer Hofstatt auf der Westseite als Hantgemâl (Stammgut), wofür er zu ewigem Eigenthum für sich und seine Nachkommen erhält, was er bisher zu Weidenbach (w. Ampfing) als Eigen innegehabt hat, nur ausgenommen die Kirche, den Kirchhof und eine Baustätte. b) Altorf übergibt dem Erzbischof am gleichen Tage 5 Hubcn zu Ergoltsbach und erhält dafür Volagangesperch (seit 13. Jahrh. Neumarkt a/Rott), wieder ausgenommen die Kirche, den Kirchhof und eine Baustätte. LANGUAGE deu
Salzburger Urkundenbuch (Ed. Hauthaler); Codex A Nummer 100 LANGUAGE deu
lat LANGUAGE None
LANGUAGE deu
Salzburger Urkundenbuch (Ed. Hauthaler); Codex A Nummer 100 LANGUAGE deu
LANGUAGE None
1910 LANGUAGE None
Salzburger Urkundenbuch; Codex A Nummer 100, in: Willibald Hauthaler, Salzburger Urkundenbuch Bd. 2, Salzburg 1910, S. 162-163. LANGUAGE None
OK. So I think (again) this was user error in that I had labelled the title in the __capitains__.xml
file as 'deu' when it should have been 'eng'. Hopefully the tests will pass now.
I am not sure what is going on now. The test works in 3.6 but not in 3.5. And they both install the same version of LXML.
The problem was that, as of Python 3.6, a dictionary iterator was returned in the insertion order. This was not the case in 3.5. What this meant was that in 3.6, since the top-level collections were always inserted in the id_to_coll
dictionary first, they would be dispatched first. And once they were dispatched as children of the default
collection, their children would be added to them instead of being dispatched at the top level. In 3.5, however, since the dict did not iterate in insertion order, any random descendant collection could be dispatched before the top-level collection came up for dispatching.
The solution I have implemented here is better because it will always dispatch the collections that have no parents first, thus meaning that any collection that has a parent will not be dispatched but, instead, added to its parent.
This last change was made to deal with the fact that between Python 3.5 and 3.6 dictionaries changed to be sorted by insertion order. This is the second time I have had to do this. If we are going to continue to support Python 3.5, we may want to start using OrderedDict so that we can always depend on dictionary order.
I think we could drop < 3.6
That would make my life much easier! These problems have made me worry a bit that some functions may rely on dictionary order too much. Perhaps it is something that is not such a problem with dictionaries ordered according to insertion order. Just something to keep an eye on as you work through the PR.
I.e., keep in mind where a function depends on dictionary order to do the right thing and perhaps in these cases add some sort of sorting to the dictionary iterator to make sure that the iterator goes off in the correct order.
Good morning ! Btw, regarding type checking, I found this resource (MyPy) https://realpython.com/python-type-checking/#the-mypy-project that we might want to use :)
Some updated information on how much slower the parsing of a collection is with this new code as opposed to the CTS-based code.
When parsing a large collection (like our collection of almost 5000 texts), the vast majority of the time in the process is spent dispatching the collection (around 97-98% for both CTS and V3). When we look at the results of the line profiler for the V3 code for _dispatch
, we see the following results:
Total time: 530.619 s
File: /home/matt/MyCapytain/MyCapytain/resolvers/capitains/local.py
Function: _dispatch at line 208
Line # Hits Time Per Hit % Time Line Contents
==============================================================
208 def _dispatch(self, collection, directory):
209 """ Run the dispatcher over a textgroup.
210
211 :param collection: Collection object that needs to be dispatched
212 :param directory: Directory in which the Collection was found
213 """
214 4766 265166483.0 55637.1 50.0 if collection.id in self.dispatcher.collection:
215 4743 265173596.0 55908.4 50.0 coll = self.dispatcher.collection[collection.id]
216 4743 276352.0 58.3 0.1 coll.update(collection)
217 else:
218 23 2254.0 98.0 0.0 self.dispatcher.dispatch(collection, path=directory)
It spent almost 9 minutes in the _dispatch
function and it looks like the really problematic part is the access to self.dispatcher
. Just calling a key from self.dispatcher
and assigning it to a variable took up 50% of the time, which means that this is also why the other 50% of the time was spent
And the problem here is that every time you want to access an item in the collection, you call the descendants
function, which goes through every descendant every single time. Take a look at the following profiler results:
Total time: 275.269 s
File: /home/matt/MyCapytain/MyCapytain/resources/prototypes/metadata.py
Function: __getitem__ at line 211
Line # Hits Time Per Hit % Time Line Contents
==============================================================
211 def __getitem__(self, key):
212 """ Retrieve an item by its ID in the tree of a collection
213
214 :param key: Key of the object to delete
215 :return: Collection identified by the item
216 """
217 14641 28566.0 2.0 0.0 if key == self.id:
218 return self
219 19936 37983.0 1.9 0.0 for obj in self.members:
220 15193 16357.0 1.1 0.0 if key == obj.id:
221 9898 3524.0 0.4 0.0 return obj
222 21456397 250995195.0 11.7 91.2 for obj in self.descendants:
223 21456397 23904589.0 1.1 8.7 if obj.id == key:
224 4743 282552.0 59.6 0.1 return obj
225 raise UnknownCollection("%s is not part of this object" % key)
In order to return 4743 values, self.descendants
gets called over 21 million times. That averages over 4500 calls per value returned. And even though it doesn't take that long to run (11.7 microseconds), that number of calls simply weighs the system down. I think we need to figure out a more efficient way to walk through the descendants of an object than what we have now, which is:
member_dict = {member.id: member for member in self.members}
member_dict.update({submember.id: submember for member in self.members for submember in member.descendants})
return list(member_dict.values())
This function needs to be recursive as written since it is not clear how deep it should go.
Perhaps the best way would be to make a self._descendants
attribute on each object? And perhaps this could be calculated during the calls to self._add_member
?
I am interested to hear your thoughts @PonteIneptique
Actually thinking more about this, the easiest answer may be just to make self.descendants
return a dictionary. Then we wouldn't need to loop through a list and check every single value. I will see what I can get to work.
Maybe the dispatcher should really reuse the collection dict of its resolver ? This would avoid changing descendants as well ?
I'll be free tomorrow morning for a chat if you want
I think we should change descendants
(as well as members
, readableDescendants
and any other iterator over the members of a collection) to return a dictionary anyway. Because we use the results of these iterators for two things. Either we want to iterate over all of the objects, in which case we could use dict.items() in place of the list, or we want to find a specific item in the iterator, usually on that item's id, in which case we get code like we have above in the __getitem__
function where we have to iterate through the whole list of descendants and test the id of every one just to find a single object. If we could rewrite that portion of __getitem__
to return self.descendants[obj.id]
, it would save a lot of loops and a lot of time.
I think I should have time to talk tomorrow morning. But we have our meeting right now. I will let you know afterwards what time would work.
Changing descendants
so that it returns a dictionary did not work. It actually slowed things down significantly.
I have made almost all of the changes here. The only one that I would have liked to have made that I didn't was refactoring parent
to parents
. I think that the @parent.setter
may have caused some problems here, though I am not 100% sure. In any case, maybe this is something that we can put off until we move the relationships
to the resolver.
I'll let you squash and merge, so that this stays your commit in merging. I do not feel like the full history is important here.
But be sure to squash
At this point, it works with
__capitains__.xml
files in the mold of those in the guidelines (https://github.com/Capitains/guidelines/blob/master/data/martial/epigrammata/__capitains__.xml and https://github.com/Capitains/guidelines/blob/master/data/martial/__capitains__.xml). It does not go any farther than parsing the children named in<members>
, i.e., it does not follow the links to other metadata files. This still needs work, so this PR is here to give concrete code for discussion.