Open annishereuniverse opened 1 year ago
Hi @annishereuniverse ,
You might want to use json dumps to replace the line print(json['data']) #DO something with your data
.
If you want to export to a file, try json.dump(data, file)
instead.
You can replace the last line like this and update the scroll
method to take file param:
# start recursive scrolling
with open('output.json', 'w') as f:
scroll(scroll_id=None, file=f)
Please use GitHub highlighted code blocks for better code readability.
Hi @rosharma9 Many thanks for your reply!
I got an issue when using json dump, I guess because I used variable name as "json" and that confuse Python as I got an error as
AttributeError: 'dict' object has no attribute 'dump'
I tried to change the variables but given that most of my codes were getting from the sample of Lens.org and I got no idea how to change. Could you please kindly suggest a way to fix it? BTW, I've already tried to update the version of my Python and json but it still doesn't work.
I also got an error for the following line:
scroll(scroll_id=None, file=f)
#TypeError: scroll() got an unexpected keyword argument 'file'
With this, could you please let me know how to replace "file=f" in my codes?
Here is what my entire codes look like, and thanks for pointing out the code blocks!
iimport requests
import time
import json
import itertools
import pandas as pd
import csv
import requests
import sys
import time
from pandas import json_normalize
import flatten_json as flatten
url = 'https://api.lens.org/patent/search'
# include fields
include = '''["biblio", "doc_key"]'''
# request body with scroll time of 1 minute
request_body = '''{
"query": {
"bool": {
"must": [
{
"match": {
"class_cpc.symbol": "Y02E10/545"
}
}
]
}
},
"include": %s,
"scroll": "1m"
}''' % include
headers = {'Authorization': 'secret', 'Content-Type': 'application/json'}
# Recursive function to scroll through paginated results
def scroll(scroll_id):
if scroll_id is not None:
global request_body
request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
# make api request
response = requests.post(url, data=request_body, headers=headers)
# If rate-limited, wait for n seconds and proceed the same scroll id
# Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
if response.status_code == requests.codes.too_many_requests:
time.sleep(8)
scroll(scroll_id)
# If the response is not ok here, better to stop here and debug it
elif response.status_code != requests.codes.ok:
print(response.json())
# If the response is ok, do something with the response, take the new scroll id and iterate
else:
json = response.json()
if json.get('results') is not None and json['results'] > 0:
scroll_id = json['scroll_id'] # Extract the new scroll id from response
json.dump('data', f, ensure_ascii=False, indent=4) #DO something with your data
scroll(scroll_id)
# start recursive scrolling
with open('output.json', 'w') as f:
scroll(scroll_id=None, file=f)
Yes, you can either rename the variable json
to something else or alias the import as import json as json_package
and use it as json_package.dump(...)
.
Second, you need to add file
param in scroll function def scroll(scroll_id, file)
and use the file
in your json dump json.dump(data, file
@rosharma9 Thank you so much.
with great support from your guidance, I’m getting several steps forward to write my codes.
Just that I don't know how to write in def scroll(scroll_id, file)
I'm planning to do the following, but got the wrong message
if scroll_id is not None:
global request_body
request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
for entry in request_body:
with open(file) as f:
for line in f.readlines()
I would be really grateful if you could help me out... What I'm wondering is also about how to flatten the nested JSON, but I guess it wouldn’t be relevant to def scroll(scroll_id, file) right?
Many thanks in advance.
My overall codes for def scroll(scroll_id, file) and got the error message: #TypeError: scroll() got an unexpected keyword argument 'file'
# Recursive function to scroll through paginated results
def scroll(scroll_id):
if scroll_id is not None:
global request_body
request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
for entry in request_body:
with open(file) as f:
for line in f.readlines()
# make api request
response = requests.post(url, data=request_body, headers=headers)
# If rate-limited, wait for n seconds and proceed the same scroll id
# Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
if response.status_code == requests.codes.too_many_requests:
time.sleep(8)
scroll(scroll_id)
# If the response is not ok here, better to stop here and debug it
elif response.status_code != requests.codes.ok:
print(response.json())
# If the response is ok, do something with the response, take the new scroll id and iterate
else:
json = response.json()
if json.get('results') is not None and json['results'] > 0:
scroll_id = json['scroll_id'] # Extract the new scroll id from response
json.dump('data', f, ensure_ascii=False, indent=4) #DO something with your data
scroll(scroll_id)
Here is a complete example for you. Let me know if it works for you. Change the query part as required
import requests
import time
import json
url = 'https://api.lens.org/patent/search'
# include fields
include = '''["biblio", "doc_key"]'''
# request body with scroll time of 1 minute
request_body = '''{
"query": {
"terms": {
"lens_id": ["031-156-664-516-153"]
}
},
"include": %s,
"scroll": "1m"
}''' % include
headers = {'Authorization': 'Bearer YOUR_TOKEN', 'Content-Type': 'application/json'}
# Recursive function to scroll through paginated results
def scroll(scroll_id, file):
# Change the request_body to prepare for next scroll api call
# Make sure to append the include fields to make faster response
if scroll_id is not None:
global request_body
request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
# make api request
response = requests.post(url, data=request_body, headers=headers)
# If rate-limited, wait for n seconds and proceed the same scroll id
# Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
if response.status_code == requests.codes.too_many_requests:
time.sleep(8)
scroll(scroll_id)
# If the response is not ok here, better to stop here and debug it
elif response.status_code != requests.codes.ok:
print(response.json())
# If the response is ok, do something with the response, take the new scroll id and iterate
else:
response = response.json()
if response.get('results') is not None and response['results'] > 0:
scroll_id = response['scroll_id'] # Extract the new scroll id from response
json.dump(response['data'], file, ensure_ascii=False, indent=4) #DO something with your data
scroll(scroll_id, file)
# start recursive scrolling
with open('output.json', 'w') as f:
scroll(scroll_id=None, file = f)
@rosharma9 Thank you very much for this! As I am still learning, the code example is really helpful. Much appreciated!
(1):star:
At first, I ran the exact same codes that you provided here and got no problem at all! I got a json file with 499 lines in it. :tada:
However, when I slightly modified the codes to get my target patents, which I just changed the term from "lens_id": ["031-156-664-516-153”]
to "class_cpc.symbol": ["Y02E10/545”]
, I got an error in the last few entries as follows:
/usr/local/bin/python3 /Users/user/PycharmProjects/pythonProject7/main.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/pythonProject7/main.py", line 49, in <module>
scroll(scroll_id=None, file = f)
File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
scroll(scroll_id, file)
File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
scroll(scroll_id, file)
File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
scroll(scroll_id, file)
[Previous line repeated 9 more times]
File "/Users/user/PycharmProjects/pythonProject7/main.py", line 35, in scroll
scroll(scroll_id)
TypeError: scroll() missing 1 required positional argument: 'file'
Process finished with exit code 1
Even so I was still getting a json file with 56814 lines from running this command.
I think the problem here is that a search using "class_cpc.symbol": ["Y02E10/545”]
yields more results than the search of "lens_id": ["031-156-664-516-153”]
. Because of this, I got an error but was unsure of what went wrong.
(2) :star:
I have a follwoing question.
I have been aware that I need to flatten nested JSON data or I might get trailing data error
and json.decoder.JSONDecodeError
.
By using a json file with 499 lines that I got from the example 1, I ran the command as follows. However, it fails, I honestly have no clue how to do this.
import csv
import pandas as pd
import json
data = json.loads(open('output.json').read())
print(len(data)) #got 1 !!! data is nested should be flatten
print("Type:", type(data)) #got list
def flatten_json(y):
out = {}
def flatten(x, name=''):
# If the Nested key-value
# pair is of dict type
if type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
df = pd.DataFrame([flatten_json(data)])
print(len(df)) #still got 1
# flattening fails
:exploding_head: And when I run the command data = json.loads(open('output.json').read())
on the json file with 56814 lines I got from the question 1, it kept sending back an error messages:
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 5971 column 2 (char 216367)
I felt like being unable to solve this issue, would be really grateful if you could please give me a hint or example.
:sunny: Again, appreciate being given the invaluable advice and guidance from you. Thank you for these helpful comments.
Hii all,
I've got all my json data in my Python console and don't know how to do that here.
I think the challenge for me is to export the API JSON response data.
After pagination handling, i can't seem to get “flatten the json data” working. It would be really helpful if you could please provide me some example codes to do so. I tried multiple times but can't manage to strike that. Felt like I'm almost there.. would you mind taking a look on my codes:
import requests import time import json import itertools import pandas as pd import csv import requests import sys import time from pandas import json_normalize import flatten_json as flatten
url = 'https://api.lens.org/patent/search'
include = '''["biblio", "doc_key"]''' request_body = '''{ "query": { "bool": { "must": [ { "match": { "class_cpc.symbol": "Y02E10/70" } } ] } }, "include": %s, "scroll": "1m" }''' % include headers = {'Authorization': 'thisismysecret', 'Content-Type': 'application/json'}
def scroll(scroll_id): if scroll_id is not None: global request_body request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
response = requests.post(url, data=request_body, headers=headers)
if response.status_code == requests.codes.too_many_requests: time.sleep(8) scroll(scroll_id) elif response.status_code != requests.codes.ok: print(response.json()) else: json = response.json() if json.get('results') is not None and json['results'] > 0: scroll_id = json['scroll_id'] # Extract the new scroll id from response print(json['data']) #DO something with your data scroll(scroll_id)
scroll(scroll_id=None)
data = [] for line in open('/Users/user/PycharmProjects/pythonProject1/sample.json', 'w'): data.append(json.loads(line)) df = json_normalize( data, meta_prefix=".", errors='ignore' ) df[['lens_id', 'doc_key', 'biblio', 'application_reference', 'title', 'classifications_cpc', 'references_cited']].head
*It may be the last ten lines got problems. I just didn't know how to flatten the nested fields ..my elective fields are: 'lens_id', 'doc_key', 'biblio', 'application_reference', 'priority_claims’, 'invention_title', 'classifications_cpc', 'references_cited'
*By the way, Aaron! thank you for your help all the way!
This attached picture is just a mini manifesto that illustrates how I struggled with this "having the json data printed in front of me but couldn't manage to export them.."