eregs / regulations-site

Web interface for viewing U.S. federal regulations and other regulatory information
https://regulations.atf.gov/
Creative Commons Zero v1.0 Universal
20 stars 46 forks source link

External citations do not work with Python 3 #514

Closed vrajmohan closed 6 years ago

vrajmohan commented 6 years ago

Query parameters of the form &section=foo seem to be prematurely and incorrectly interpreted as the section entity reference § and end up as §ion=foo! E.g.,

vrajmohan commented 6 years ago

@cmc333333 correctly identified the cause as https://github.com/eregs/regulations-site/blob/master/regulations/generator/layers/layers_applier.py#L32. This does not happen on Python 2 but happens in Python3 as a result of https://docs.python.org/3.6/library/html.entities.html#html.entities.html5.

Note that the trailing semicolon is included in the name (e.g. 'gt;'), however some of the names are accepted by the standard even without the semicolon: in this case the name is present with and without the ';'

vrajmohan commented 6 years ago

A possible useful observation - in https://www.fec.gov/regulations/111-1/2018-annual-111#111-1, the 1st citation (52 U.S.C. 30101) is affected while the 2nd one (26 U.S.C. 9001) is not.

vrajmohan commented 6 years ago

The 1st citation is affected but not the 2nd one because the unescaping is invoked multiple times. It is invoked for each replacement, not just on the replacement text, but on the entire node text at https://github.com/eregs/regulations-site/blob/master/regulations/generator/layers/layers_applier.py#L49 and https://github.com/eregs/regulations-site/blob/master/regulations/generator/layers/layers_applier.py#L58.

When I removed the calls to self.unescape_text(), the FEC regulations looked fine at a cursory glance and the external citations worked properly.

It would be nice to understand what is meant by the comment https://github.com/eregs/regulations-site/blob/master/regulations/generator/layers/layers_applier.py#L33.

vrajmohan commented 6 years ago

Thanks for fixing this quickly!

cmc333333 commented 6 years ago

Thanks for finding it @vrajmohan ! We had to review a bunch of history going back to CFPB (https://github.com/cfpb/regulations-site/pull/69), but the problem really stemmed from #42, in which we sped up layer replacement, but forgot to remove the unescape logic from the previous implementation. Props to @tadhg-ohiggins for the fix!