daneads / pypatent

Search for and retrieve US Patent and Trademark Office Patent Data
GNU General Public License v3.0
71 stars 19 forks source link

AttributeError: 'NoneType' object has no attribute 'find_next' #6

Open random1717 opened 5 years ago

random1717 commented 5 years ago

Error when running your example:

pypatent.Search('TTL/(tennis AND (racquet OR racket))')

AttributeError                            Traceback (most recent call last)
<ipython-input-2-a7c0dc5b3207> in <module>
----> 1 pypatent.Search('TTL/(tennis AND (racquet OR racket))')

/usr/local/lib/python3.7/site-packages/pypatent/__init__.py in __init__(self, string, results_limit, get_patent_details, pn, isd, ttl, abst, aclm, spec, ccl, cpc, cpcl, icl, apn, apd, apt, govt, fmid, parn, rlap, rlfd, prir, prad, pct, ptad, pt3d, pppd, reis, rpaf, afff, afft, in_, ic, is_, icn, aanm, aaci, aast, aaco, aaat, lrep, an, ac, as_, acn, exp, exa, ref, fref, oref, cofc, reex, ptab, sec, ilrn, ilrd, ilpd, ilfd)
    245         r = requests.get(url, headers=Constants.request_header).text
    246         s = BeautifulSoup(r, 'html.parser')
--> 247         total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
    248 
    249         patents = self.get_patents_from_results_url(url, limit=results_limit)

AttributeError: 'NoneType' object has no attribute 'find_next'
codypilot commented 5 years ago

Just ran into this issue as well. The problem lies within the URL formatting, specifically line 232's replace method which changes spaces to hyphens. An easy fix is to remove that replace method and ensure that multi-word terms have escaped quotes, such as: pypatent.Search(an="\"hoffmann la roche\"", spec="diagnostics", results_limit=1).as_list()

daneads commented 5 years ago

This is related to the issue I've been having as well. The problem is: Javascript is now enforced on the search site.

If you look at the failing requests and print the text of the results page, you will see this:

There is no page content, which thus throws an error.

Looking for a workaround.

codypilot commented 5 years ago

Selenium may be a good alternative but it'd certainly be slower/have more overhead

amotl commented 5 years ago

Hi there,

thanks @daneads for conceiving and maintaining this great library. I'm looking forward to use it from PatZilla, which might also spark your interest.

Introduction

Today, when trying to find an answer to https://github.com/ip-tools/uspto-opendata-python/issues/2, I gave pypatent a try and had the same issue:

>>> import pypatent
>>> pypatent.Search('TTL/(tennis AND (racquet OR racket))')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/dev/sources/uspto-pbd/.venv3/lib/python3.7/site-packages/pypatent/__init__.py", line 247, in __init__
    total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
AttributeError: 'NoneType' object has no attribute 'find_next'

Investigation

After investigating a bit, I found the response body of the request to http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=TTL%2F%28tennis+AND+%28racquet+OR+racket%29%29&d=PTXT to be valid HTML without any Javascript obfuscation and - as it does contain the phrase "Hits 1 through 50 out of 378" - it actually should be parseable.

I verified this detail by requesting the URL using non-Javascript capable clients like curl and HTTPie.

Runtime error

However, I can confirm the code

s = BeautifulSoup(r, 'html.parser')
total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())

currently still fails on that response.

With kind regards, Andreas.

Outlook

P.S.: When Javascript obfuscation things like /TSPD/08a752ce24ab200072a9cd92ec33dd5eff668cb1017860a8b5fb68de1351a3b1958ef77169637fb8?type=7 will still be an issue, please let me know as I might come up with a more detailed information about the specific obfuscation mechanism which might be used there. Been there, seen that... ;]

Background:

The problem is: Javascript is now enforced on the search site.

This is obviously not always the case. It only might be looking like this, but the respective Javascript obfuscation is in fact optional and depends on the origin (country) where the request has been issued from.

amotl commented 5 years ago

Just wanted to let you know that running this code on the Python REPL prompt works perfectly fine for me

>>> import re
>>> import requests
>>> from bs4 import BeautifulSoup

>>> r = requests.get('http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=TTL%2F%28tennis+AND+%28racquet+OR+racket%29%29&d=PTXT', headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'})
>>> s = BeautifulSoup(r.text, 'html.parser')
>>> int(s.find(string=re.compile('out of')).find_next().text.strip())
378

while

>>> import pypatent
>>> pypatent.Search('TTL/(tennis AND (racquet OR racket))')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amo/dev/elmyra/sources/uspto-pbd/.venv3/lib/python3.7/site-packages/pypatent/__init__.py", line 247, in __init__
    total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
AttributeError: 'NoneType' object has no attribute 'find_next'

still fails.

Bummer. Currently, I'm clueless about the root cause of this as I was expecting to essentially run the same code through both variants here.

daneads commented 5 years ago

@codypilot I'd say the best route now is to implement some sort of headless browser via Selenium. A pain/trick to install, but would get around this JS issue.

amotl commented 5 years ago

Dear @daneads,

thanks for follwing up on this. I have some thoughts about this I would like to share with you.

Investigating the problem further

Do you still hit the *wall the USPTO apparently has employed recently? I still experience flawless direct access from Germany. To investigate this further, may I humbly ask you to run a curl command like outlined at [1] and tell me about its output and the country your request might have originated from?

the respective Javascript obfuscation is in fact optional and depends on the origin (country) where the request has been issued from.

Been there, seen that

Been there already with other resources published by organizations from the field of intellectual property and found out many details about the protection mechanism lingering through by

<script type="text/javascript" src="/TSPD/08a752ce24ab200072a9cd92ec33dd5eff668cb1017860a8b5fb68de1351a3b1958ef77169637fb8?type=7"></script>

Solution

to implement some sort of headless browser via Selenium

Right. When hitting that wall recently elsewhere and analyzing some of its details, I figured that would be the only viable solution. Coming from that, there's a Python implementation based on Marionette in my toolbox now which might be about 95% finished already. Please let me know if you would be interested in that to be added to pypatent.

With kind regards, Andreas.

[1] https://gist.github.com/amotl/bc99f3a3b7cd77c19475f74cfcbee999

random1717 commented 5 years ago

Any update on this one?