Closed AllenDowney closed 12 years ago
I have added a pass to tree_cleaner to replace labels. It is mostly working, but I am stuck on one thing:
backward xrefs see the updated labels; forward references still have the old ones.
Thinking...
This is a problem I have all the time--we use colons in our BibTeX tags! So I postprocess the xml to remove colons from every id and every linkref, replacing with '_'. I haven't see the problem with the dot in labels yet, I should modify my code to take care of that as well.
Do you have your postprocessing code handy?
I am currently attempting it in tree_cleaner, but I think that might be a mistake. Might be easiest to get them all in a second pass.
Allen
On Mon, Aug 13, 2012 at 5:18 PM, Tim Arnold notifications@github.comwrote:
This is a problem I have all the time--we use colons in our BibTeX tags! So I postprocess the xml to remove colons from every id and every linkref, replacing with '_'. I haven't see the problem with the dot in labels yet, I should modify my code to take care of that as well.
— Reply to this email directly or view it on GitHubhttps://github.com/AllenDowney/plastex-oreilly/issues/20#issuecomment-7708397.
I use lxml in postprocessing. Here is the method, along with some commentary in the code:
def clean_bibliography(self, tree):
'''
Most of our bibtex keys have colons and that is illegal in links.
This method finds all cite links and removes the colon. Also remove
all the colon from the id of all the items in a bibliographylist.
Because there are references throughout a book to the same biblio
entries (and biblios are done by chapter), we prepend the name of
the chapter to each biblio anchor and linkend so we don't get
duplicate anchors.
Also the plastex natbib.py will return the anchor as a
a filename#anchor and that's illegal in docbook, we just want the
bit that comes after the # sign.
'''
for elem in tree.findall('//d:link', namespaces=xns):
if not elem.attrib.get('linkend'):
continue
parent = elem.getparent()
if (parent.get('remap') and parent.get('remap').startswith('cite')) \
or (elem.get('remap') and elem.get('remap').startswith('cite')):
text = elem.get('linkend')
text = text.replace(':','')
if text.count('#'):
i = text.index('#')
elem.set('linkend','%s%s' % (self.name, text[i+1:]))
else:
elem.set('linkend', '%s%s' % (self.name, text))
for item in tree.findall('//d:section', namespaces=xns):
if item.attrib.get('role') != 'bibliography':
continue
e = item.findall('d:itemizedlist/d:para', namespaces=xns)
if e:
self.drop_tag(e[0])
for elem in item.iterdescendants():
id_text = elem.get('id')
if id_text:
elem.set("%sid" % ans, '%s%s' % (self.name,
id_text.replace(':', '')))
del elem.attrib['id']
return tree
if you decide to go this way, I'll bundle up the code (there's a missing global in the code above and a call to another method).
Ok, I think I've got this. It turned out to be ugly because the labels get stored in the Document.Context, so cleaning them out the of the tree is not enough. I had to clean them on the way into the Context as well.
See commit 5eff13092173c44c82bcad8ff176c9329f139f2a (in the oreilly repo)
Some labels that are legal in LaTeX are not legal in XML. For example, they are not allowed to have : or . in the label.
_ is ok.
So we either need to ban XML-unfriendly labels or get plasTeX to convert them.