facelessuser / soupsieve

A modern CSS selector implementation for BeautifulSoup
https://facelessuser.github.io/soupsieve/
MIT License
205 stars 38 forks source link

TypeError when namespaces contain the key "self" #216

Closed Leyard closed 3 years ago

Leyard commented 3 years ago

I was using bs4/soupsieve to parse some xml files from SEC websites. Here is my MWE

import requests
from bs4 import BeautifulSoup

url = "https://www.sec.gov/Archives/edgar/data/1031235/000156459017010923/self-20170331.xml"
r = requests.get(url)

soup = BeautifulSoup(r.content, "xml")
print(soup.select("identifier"))

It worked smoothly until I got a weird TypeError when I used the CSS selector

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-12dc7c20e133> in <module>
----> 1 soup.select("identifier")

~/opt/anaconda3/lib/python3.8/site-packages/bs4/element.py in select(self, selector, namespaces, limit, **kwargs)
   1867             )
   1868
-> 1869         results = soupsieve.select(selector, self, namespaces, limit, **kwargs)
   1870
   1871         # We do this because it's more consistent and because

~/opt/anaconda3/lib/python3.8/site-packages/soupsieve/__init__.py in select(select, tag, namespaces, limit, flags, **kwargs)
     96     """Select the specified tags."""
     97
---> 98     return compile(select, namespaces, flags, **kwargs).select(tag, limit)
     99
    100

~/opt/anaconda3/lib/python3.8/site-packages/soupsieve/__init__.py in compile(pattern, namespaces, flags, **kwargs)
     45
     46     if namespaces is not None:
---> 47         namespaces = ct.Namespaces(**namespaces)
     48
     49     custom = kwargs.get('custom')

TypeError: __init__() got multiple values for argument 'self'

It turned out this specific xml file contains namespaces with the key "self", which caused the TypeError when you unpacked the namespaces as keyword parameters in line 47 of soupsieve/__init__.py

In [11]: soup._namespaces
Out[11]:
{'xml': 'http://www.w3.org/XML/1998/namespace',
 'utr': 'http://www.xbrl.org/2009/utr',
 'iso4217': 'http://www.xbrl.org/2003/iso4217',
 'self': 'http://globalselfstorageinc.com/20170331',
 'xbrll': 'http://www.xbrl.org/2003/linkbase',
 'xlink': 'http://www.w3.org/1999/xlink',
 'nonnum': 'http://www.xbrl.org/dtr/type/non-numeric',
 'num': 'http://www.xbrl.org/dtr/type/numeric',
 'xbrldt': 'http://xbrl.org/2005/xbrldt',
 'us-types': 'http://fasb.org/us-types/2016-01-31',
 'us-gaap': 'http://fasb.org/us-gaap/2016-01-31',
 'dei': 'http://xbrl.sec.gov/dei/2014-01-31',
 'country': 'http://xbrl.sec.gov/country/2016-01-31',
 'currency': 'http://xbrl.sec.gov/currency/2016-01-31',
 'exch': 'http://xbrl.sec.gov/exch/2016-01-31',
 'invest': 'http://xbrl.sec.gov/invest/2013-01-31',
 'stpr': 'http://xbrl.sec.gov/stpr/2011-01-31',
 'sic': 'http://xbrl.sec.gov/sic/2011-01-31',
 'naics': 'http://xbrl.sec.gov/naics/2011-01-31',
 'xbrldi': 'http://xbrl.org/2006/xbrldi',
 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}

Not sure if this counts as a bug of soupsieve, or should I handle this issue on my side. Feel free to suggest a solution for me.

facelessuser commented 3 years ago

Yup, this is a bug. I have a fix in #217. I'm not quite sure why I was using kwargs for this. I absolutely don't need it, especially if it can conflict with self.

After the fix your example script runs fine:

$soupsieve git:(master) ✗ python3 bug.py
[<identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>, <identifier scheme="http://www.sec.gov/CIK">0001031235</identifier>]
facelessuser commented 3 years ago

Thanks for the bug report! I've tagged a new release 2.2.1. It should be available shortly.