DOC: How to read data into pandas / dask / xarray

ckan / ckanapi

A command line interface and Python module for accessing the CKAN Action API

Other

181 stars 74 forks source link

I am trying to pull data from "https://opendata.nhsbsa.net/dataset/english-prescribing-dataset-epd-with-snomed-code/resource/374ee7ac-fd8e-4c3f-b7a9-6ea27cc16d63" website.

The website provides API to scrape large dataset. The data I am pulling is 17Million records.

API to pull the data :

'https://opendata.nhsbsa.net/api/3/action/datastore_search?offset=0&resource_id=EPD_SNOMED_202109&limit=5000'

Below is the code I am running.

import requests
import json

offset = 70000

for i in range(0,17000000,offset):
    url = 'https://opendata.nhsbsa.net/api/3/action/datastore_search?offset=' + str(i+1) + '&resource_id=EPD_SNOMED_'+ str(202109) +'&limit=' + str(offset)
    r= requests.get(url).json()
    df=pd.DataFrame(r['result']['records'])
    if i == 0:
      df.to_csv('data_pull.csv',mode='a', header=True,index=False)
    else:
      df.to_csv('data_pull.csv',mode='a', header=False,index=False)

The above code is taking more than 3hours and also gives duplicate values. There are no duplicates present in the actual data.

Please provide a better answer to below question:

https://stackoverflow.com/questions/70209859/web-scaping-in-python-for-large-data-set-from-api

suggestion needed on a better library or process to do this.

ckan / ckanapi

DOC: How to read data into pandas / dask / xarray #145