Kotzly / DS4H_Course

MIT License
0 stars 0 forks source link

Downloading DATASUS data #1

Closed Kotzly closed 3 years ago

Kotzly commented 3 years ago

DATASUS disponibiliza informações que podem servir para subsidiar análises objetivas da situação sanitária, tomadas de decisão baseadas em evidências e elaboração de programas de ações de saúde.

DATASUS is an online platform that enables the access to public health data information from Brazil. The goal of this issue is explaining how to access, download and pre-process the database files.

Kotzly commented 3 years ago

Online access

It is possible to access the Monitoring Panels to create tables and visualizations that are almost analysis-ready. These tables contain the selected information and in most cases have different options for aggregating information.

Other option is TABNET, where the information is also online, but there are many options for aggregating, selecting and filtering data. The following images show where to access this information, and an example of how the data is presented.

image

image

TABNET data can be download as .csv or .tab, to be loaded by TABWIN.

Downloading

This data can also be download in a more raw format, in this link by acessing the TabWin option in the DataSUS website. This will load the File Transfer page. In this page you download all types of data, their documentation (with data dictionaries) and more.

Downloading TABNET

In this example we won't use TABNET, but we will use the dbf2dbc.exe program that comes with it. These are the steps to download it:

Follow the same steps, but select "Documentação" instead of "Programas" to download TABWIN's documentation.

Downloading the data

In this same page you can download the data:

Example: image

Using dbf2dbc

The files come in .dbc format, which appear to be a compressed database format. First we need to deflate this compressed file. This can be done using the dbf2dbc.exe program that comes with TABWIN, or can be downloaded here. This program will decompress each .dbc file to the .dbf format, which can be loaded with TABWIN or with Python using the simpledbf package. More information about the dbf2dbc tool can be found here.

To use the dbf2dbc tool, first unzip the data that you downloaded to a folder. This folder can contain one or many dbc files. Let's suppose this folder is at C:\Users\Joao\Documents\data. Now open command line (cmd or powershell), go to the folder where dbf2dbc is at and run the command:

dbf2dbc.exe "C:\Users\Joao\Documents\data\*.dbc" "C:\Users\Joao\Documents\data"

This will uncompress the dbc files to .dbf files and will save them in the same folder. You can change the second argument to save the files elsewhere.

The .dbf file can be loaded within TABWIN, or you can load them into Python using the simpledbf package. The following sample code is in Python and loads the C:/Users/Joao/data/DNSP2018.dbf file, transforms it to a pandas dataframe, and also saves it in csv format in the same folder.

import pandas as pd
from simpledbf import Dbf5

filepath = "C:/Users/Joao/data/DNSP2018.dbf"
dbf = Dbf5(filepath)
df = dbf.to_dataframe()
dbf.to_csv("C:/Users/Joao/data/DNSP2018.csv")
Kotzly commented 3 years ago

More documentation and informations about TABNET and TABWIN can be found here