FDA / openfda

openFDA is an FDA project to provide open APIs, raw data downloads, documentation and examples, and a developer community for an important collection of FDA public datasets.
https://open.fda.gov
Creative Commons Zero v1.0 Universal
574 stars 132 forks source link

where can I find the underlying data store for this code project? #88

Closed davidem00 closed 6 years ago

davidem00 commented 6 years ago

Does this project just obtain the current JSON downloads from the live openFDA api, and then do something new with them?

I saw another issue that makes mention of some "underlying" XML data. Where does that reside?

violetcrestedwren commented 6 years ago

Hi @davidem00, this project creates the JSON formatted data and hosts it on the web through an api service (as seen on https://api.fda.gov).

Regarding the underlying data, it's pulled in off the web from a handful of sources, which you can find in various pipeline folders. Looking at one of the simpler examples, the Device Premarket Approval pipeline , you can see the data source on line 32: a web-hosted zip file managed by the FDA. The pipeline scrapes the website to download the zip file (lines 34-43), unzips and cleans the data (lines 45-59), maps the data and translates it into json format (lines 61-83), and harmonizes the data with the other device endpoints (lines 85-113). Finally lines 116-122 to load the data into ElasticSearch. All of these tasks are managed by Luigi, as called on lines 125-126.

Let me know if you have any further questions, I'll leave this issue open for a few days.

davidem00 commented 6 years ago

Ok, thanks. I'm specifically interested in data from the output of the /label endpoint.

Since I don't see a single pipeline for that, I'm guessing it's assembled from multiple sources to be presented in the form it is.

Which set of pipelines correspond to that endpoint? (I don't need Events) at this stage.

While I realize it is peeking under the hood, I'm trying to see if understanding the underlying data will give me a better sense of how to parse out the results of the API that are clumped together into single values.

dkrylovsb commented 6 years ago

The pipeline that corresponds to the Drug Label endpoint is here: https://github.com/FDA/openfda/blob/master/openfda/spl/pipeline.py