plasTeX needs to convert labels that are not legal in XML

AllenDowney commented 12 years ago

Some labels that are legal in LaTeX are not legal in XML. For example, they are not allowed to have : or . in the label.

_ is ok.

So we either need to ban XML-unfriendly labels or get plasTeX to convert them.

AllenDowney commented 12 years ago

I have added a pass to tree_cleaner to replace labels. It is mostly working, but I am stuck on one thing:

backward xrefs see the updated labels; forward references still have the old ones.

Thinking...

tiarno commented 12 years ago

This is a problem I have all the time--we use colons in our BibTeX tags! So I postprocess the xml to remove colons from every id and every linkref, replacing with '_'. I haven't see the problem with the dot in labels yet, I should modify my code to take care of that as well.

AllenDowney commented 12 years ago

Do you have your postprocessing code handy?

I am currently attempting it in tree_cleaner, but I think that might be a mistake. Might be easiest to get them all in a second pass.

Allen

On Mon, Aug 13, 2012 at 5:18 PM, Tim Arnold notifications@github.comwrote:

This is a problem I have all the time--we use colons in our BibTeX tags! So I postprocess the xml to remove colons from every id and every linkref, replacing with '_'. I haven't see the problem with the dot in labels yet, I should modify my code to take care of that as well.

— Reply to this email directly or view it on GitHubhttps://github.com/AllenDowney/plastex-oreilly/issues/20#issuecomment-7708397.

tiarno commented 12 years ago

I use lxml in postprocessing. Here is the method, along with some commentary in the code:


    def clean_bibliography(self, tree):
        ''' 
           Most of our bibtex keys have colons and that is illegal in links.
           This method finds all cite links and removes the colon. Also remove
           all the colon from the id of all the items in a bibliographylist.

           Because there are references throughout a book to the same biblio
           entries (and biblios are done by chapter), we prepend the name of
           the chapter to each biblio anchor and linkend so we don't get 
           duplicate anchors.

           Also the plastex natbib.py will return the anchor as a 
           a filename#anchor and that's illegal in docbook, we just want the
           bit that comes after the # sign.
        '''
        for elem in tree.findall('//d:link', namespaces=xns):
            if not elem.attrib.get('linkend'):
                continue
            parent = elem.getparent()
            if (parent.get('remap') and parent.get('remap').startswith('cite')) \
            or (elem.get('remap') and elem.get('remap').startswith('cite')):
                text = elem.get('linkend')
                text = text.replace(':','')
                if text.count('#'):
                    i = text.index('#')
                    elem.set('linkend','%s%s' % (self.name, text[i+1:]))
                else:
                    elem.set('linkend', '%s%s' % (self.name, text))

        for item in tree.findall('//d:section', namespaces=xns):
            if item.attrib.get('role') != 'bibliography':
                continue
            e = item.findall('d:itemizedlist/d:para', namespaces=xns)
            if e:
                self.drop_tag(e[0])
            for elem in item.iterdescendants():
                id_text = elem.get('id')
                if id_text:
                    elem.set("%sid" % ans, '%s%s' % (self.name,
                                                     id_text.replace(':', '')))
                    del elem.attrib['id']
        return tree

tiarno commented 12 years ago

if you decide to go this way, I'll bundle up the code (there's a missing global in the code above and a call to another method).

AllenDowney commented 12 years ago

Ok, I think I've got this. It turned out to be ugly because the labels get stored in the Document.Context, so cleaning them out the of the tree is not enough. I had to clean them on the way into the Context as well.

See commit 5eff13092173c44c82bcad8ff176c9329f139f2a (in the oreilly repo)

AllenDowney / plastex-oreilly

plasTeX needs to convert labels that are not legal in XML #20