cambialens / lens-api-doc

10 stars 5 forks source link

How to export the json data from Python console with flattening the nested fields #58

Open annishereuniverse opened 1 year ago

annishereuniverse commented 1 year ago

Hii all,

I've got all my json data in my Python console and don't know how to do that here.

I think the challenge for me is to export the API JSON response data.

After pagination handling, i can't seem to get “flatten the json data” working. It would be really helpful if you could please provide me some example codes to do so. I tried multiple times but can't manage to strike that. Felt like I'm almost there.. would you mind taking a look on my codes:

import requests import time import json import itertools import pandas as pd import csv import requests import sys import time from pandas import json_normalize import flatten_json as flatten

url = 'https://api.lens.org/patent/search'

include = '''["biblio", "doc_key"]''' request_body = '''{ "query": { "bool": { "must": [ { "match": { "class_cpc.symbol": "Y02E10/70" } } ] } }, "include": %s, "scroll": "1m" }''' % include headers = {'Authorization': 'thisismysecret', 'Content-Type': 'application/json'}

def scroll(scroll_id): if scroll_id is not None: global request_body request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)

response = requests.post(url, data=request_body, headers=headers)

if response.status_code == requests.codes.too_many_requests: time.sleep(8) scroll(scroll_id) elif response.status_code != requests.codes.ok: print(response.json()) else: json = response.json() if json.get('results') is not None and json['results'] > 0: scroll_id = json['scroll_id'] # Extract the new scroll id from response print(json['data']) #DO something with your data scroll(scroll_id)

scroll(scroll_id=None)

data = [] for line in open('/Users/user/PycharmProjects/pythonProject1/sample.json', 'w'): data.append(json.loads(line)) df = json_normalize( data, meta_prefix=".", errors='ignore' ) df[['lens_id', 'doc_key', 'biblio', 'application_reference', 'title', 'classifications_cpc', 'references_cited']].head

*It may be the last ten lines got problems. I just didn't know how to flatten the nested fields ..my elective fields are: 'lens_id', 'doc_key', 'biblio', 'application_reference', 'priority_claims’, 'invention_title', 'classifications_cpc', 'references_cited'

*By the way, Aaron! thank you for your help all the way!

This attached picture is just a mini manifesto that illustrates how I struggled with this "having the json data printed in front of me but couldn't manage to export them.."

Screenshot 2022-10-29 at 22 40 29
rosharma9 commented 1 year ago

Hi @annishereuniverse , You might want to use json dumps to replace the line print(json['data']) #DO something with your data. If you want to export to a file, try json.dump(data, file) instead. You can replace the last line like this and update the scroll method to take file param:

# start recursive scrolling
with open('output.json', 'w') as f:
    scroll(scroll_id=None, file=f)

Please use GitHub highlighted code blocks for better code readability.

annishereuniverse commented 1 year ago

Hi @rosharma9 Many thanks for your reply!

I got an issue when using json dump, I guess because I used variable name as "json" and that confuse Python as I got an error as

AttributeError: 'dict' object has no attribute 'dump'

I tried to change the variables but given that most of my codes were getting from the sample of Lens.org and I got no idea how to change. Could you please kindly suggest a way to fix it? BTW, I've already tried to update the version of my Python and json but it still doesn't work.

I also got an error for the following line:

scroll(scroll_id=None, file=f)

#TypeError: scroll() got an unexpected keyword argument 'file'

With this, could you please let me know how to replace "file=f" in my codes?

Here is what my entire codes look like, and thanks for pointing out the code blocks!

iimport requests
import time
import json
import itertools
import pandas as pd
import csv
import requests
import sys
import time
from pandas import json_normalize
import flatten_json as flatten

url = 'https://api.lens.org/patent/search'
# include fields
include = '''["biblio", "doc_key"]'''
# request body with scroll time of 1 minute
request_body = '''{
  "query": {
     "bool": {
         "must": [
             {
                "match": {
                   "class_cpc.symbol": "Y02E10/545"
                } 
             } 
         ] 
     }
  },
"include": %s,
  "scroll": "1m"
}''' % include
headers = {'Authorization': 'secret', 'Content-Type': 'application/json'}

# Recursive function to scroll through paginated results
def scroll(scroll_id):
  if scroll_id is not None:
    global request_body
    request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)

  # make api request
  response = requests.post(url, data=request_body, headers=headers)

  # If rate-limited, wait for n seconds and proceed the same scroll id
  # Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
  if response.status_code == requests.codes.too_many_requests:
    time.sleep(8)
    scroll(scroll_id)
  # If the response is not ok here, better to stop here and debug it
  elif response.status_code != requests.codes.ok:
    print(response.json())
  # If the response is ok, do something with the response, take the new scroll id and iterate
  else:
    json = response.json()
    if json.get('results') is not None and json['results'] > 0:
        scroll_id = json['scroll_id'] # Extract the new scroll id from response
        json.dump('data', f, ensure_ascii=False, indent=4) #DO something with your data
        scroll(scroll_id)

# start recursive scrolling
with open('output.json', 'w') as f:
    scroll(scroll_id=None, file=f)
rosharma9 commented 1 year ago

Yes, you can either rename the variable json to something else or alias the import as import json as json_package and use it as json_package.dump(...). Second, you need to add file param in scroll function def scroll(scroll_id, file) and use the file in your json dump json.dump(data, file

annishereuniverse commented 1 year ago

@rosharma9 Thank you so much.

with great support from your guidance, I’m getting several steps forward to write my codes.

Just that I don't know how to write in def scroll(scroll_id, file)

I'm planning to do the following, but got the wrong message

 if scroll_id is not None:
    global request_body
    request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
    for entry in request_body:
    with open(file) as f:
            for line in f.readlines()

I would be really grateful if you could help me out... What I'm wondering is also about how to flatten the nested JSON, but I guess it wouldn’t be relevant to def scroll(scroll_id, file) right?

Many thanks in advance.

My overall codes for def scroll(scroll_id, file) and got the error message: #TypeError: scroll() got an unexpected keyword argument 'file'

# Recursive function to scroll through paginated results
def scroll(scroll_id):
  if scroll_id is not None:
    global request_body
    request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)
   for entry in request_body:
    with open(file) as f:
            for line in f.readlines()

  # make api request
  response = requests.post(url, data=request_body, headers=headers)

  # If rate-limited, wait for n seconds and proceed the same scroll id
  # Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
  if response.status_code == requests.codes.too_many_requests:
    time.sleep(8)
    scroll(scroll_id)
  # If the response is not ok here, better to stop here and debug it
  elif response.status_code != requests.codes.ok:
    print(response.json())
  # If the response is ok, do something with the response, take the new scroll id and iterate
  else:
    json = response.json()
    if json.get('results') is not None and json['results'] > 0:
        scroll_id = json['scroll_id'] # Extract the new scroll id from response
        json.dump('data', f, ensure_ascii=False, indent=4) #DO something with your data
        scroll(scroll_id)
rosharma9 commented 1 year ago

Here is a complete example for you. Let me know if it works for you. Change the query part as required

import requests
import time
import json
url = 'https://api.lens.org/patent/search'

# include fields
include = '''["biblio", "doc_key"]'''
# request body with scroll time of 1 minute
request_body = '''{
  "query": {
      "terms":  {
          "lens_id": ["031-156-664-516-153"]
      }
  },
  "include": %s,
  "scroll": "1m"
}''' % include
headers = {'Authorization': 'Bearer YOUR_TOKEN', 'Content-Type': 'application/json'}

# Recursive function to scroll through paginated results
def scroll(scroll_id, file):
  # Change the request_body to prepare for next scroll api call
  # Make sure to append the include fields to make faster response
  if scroll_id is not None:
    global request_body
    request_body = '''{"scroll_id": "%s", "include": %s}''' % (scroll_id, include)

  # make api request
  response = requests.post(url, data=request_body, headers=headers) 

  # If rate-limited, wait for n seconds and proceed the same scroll id
  # Since scroll time is 1 minutes, it will give sufficient time to wait and proceed
  if response.status_code == requests.codes.too_many_requests:
    time.sleep(8)
    scroll(scroll_id)
  # If the response is not ok here, better to stop here and debug it
  elif response.status_code != requests.codes.ok:
    print(response.json())
  # If the response is ok, do something with the response, take the new scroll id and iterate
  else:
    response = response.json()
    if response.get('results') is not None and response['results'] > 0:
        scroll_id = response['scroll_id'] # Extract the new scroll id from response
        json.dump(response['data'], file, ensure_ascii=False, indent=4) #DO something with your data
        scroll(scroll_id, file)

# start recursive scrolling
with open('output.json', 'w') as f:
  scroll(scroll_id=None, file = f)
annishereuniverse commented 1 year ago

@rosharma9 Thank you very much for this! As I am still learning, the code example is really helpful. Much appreciated!

(1):star:

At first, I ran the exact same codes that you provided here and got no problem at all! I got a json file with 499 lines in it. :tada: However, when I slightly modified the codes to get my target patents, which I just changed the term from "lens_id": ["031-156-664-516-153”] to "class_cpc.symbol": ["Y02E10/545”], I got an error in the last few entries as follows:

/usr/local/bin/python3 /Users/user/PycharmProjects/pythonProject7/main.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/pythonProject7/main.py", line 49, in <module>
    scroll(scroll_id=None, file = f)
  File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
    scroll(scroll_id, file)
  File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
    scroll(scroll_id, file)
  File "/Users/user/PycharmProjects/pythonProject7/main.py", line 45, in scroll
    scroll(scroll_id, file)
  [Previous line repeated 9 more times]
  File "/Users/user/PycharmProjects/pythonProject7/main.py", line 35, in scroll
    scroll(scroll_id)

TypeError: scroll() missing 1 required positional argument: 'file'

Process finished with exit code 1

Even so I was still getting a json file with 56814 lines from running this command. I think the problem here is that a search using "class_cpc.symbol": ["Y02E10/545”] yields more results than the search of "lens_id": ["031-156-664-516-153”]. Because of this, I got an error but was unsure of what went wrong.

(2) :star: I have a follwoing question. I have been aware that I need to flatten nested JSON data or I might get trailing data error and json.decoder.JSONDecodeError.

By using a json file with 499 lines that I got from the example 1, I ran the command as follows. However, it fails, I honestly have no clue how to do this.

import csv
import pandas as pd
import json

data = json.loads(open('output.json').read())

print(len(data)) #got 1 !!! data is nested should be flatten

print("Type:", type(data)) #got list

def flatten_json(y):
    out = {}

    def flatten(x, name=''):

        # If the Nested key-value
        # pair is of dict type
        if  type(x) is list:

            i = 0

            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

df = pd.DataFrame([flatten_json(data)])

print(len(df)) #still got 1

# flattening fails

:exploding_head: And when I run the command data = json.loads(open('output.json').read()) on the json file with 56814 lines I got from the question 1, it kept sending back an error messages:

raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 5971 column 2 (char 216367)

I felt like being unable to solve this issue, would be really grateful if you could please give me a hint or example.

:sunny: Again, appreciate being given the invaluable advice and guidance from you. Thank you for these helpful comments.