Open westurner opened 5 years ago
I am trying to pull data from "https://opendata.nhsbsa.net/dataset/english-prescribing-dataset-epd-with-snomed-code/resource/374ee7ac-fd8e-4c3f-b7a9-6ea27cc16d63" website.
The website provides API to scrape large dataset. The data I am pulling is 17Million records.
API to pull the data :
Below is the code I am running.
import requests
import json
offset = 70000
for i in range(0,17000000,offset):
url = 'https://opendata.nhsbsa.net/api/3/action/datastore_search?offset=' + str(i+1) + '&resource_id=EPD_SNOMED_'+ str(202109) +'&limit=' + str(offset)
r= requests.get(url).json()
df=pd.DataFrame(r['result']['records'])
if i == 0:
df.to_csv('data_pull.csv',mode='a', header=True,index=False)
else:
df.to_csv('data_pull.csv',mode='a', header=False,index=False)
The above code is taking more than 3hours and also gives duplicate values. There are no duplicates present in the actual data.
Please provide a better answer to below question:
https://stackoverflow.com/questions/70209859/web-scaping-in-python-for-large-data-set-from-api
suggestion needed on a better library or process to do this.
Is there a good reference or a ckanapi function on how to read datasets from a CKAN instance into pandas and/or dask and/or xarray?
Pandas
pandas.read_json("https://url.to/dataset.json")
Dask
dask.dataframe.read_json("https://url.to/dataset.json")
xarray
Caching