UA-SRC-data / data_loaders

Data loaders
MIT License
2 stars 0 forks source link

Data loaders for UA SRC

This repo contains the code for processing data for the UA Superfund Research Center data project. This includes data for Garden Roots as well as data collected for the collaborative projects between UA and UC San Diego and between UA and the Colorado School of Mines.

Cyverse path for data: /iplant/home/rwalls/ua-src-data

note: The data on CyVerse are only available to project personel.

Data Processing SOP

  1. Gather raw data
    • Store data as they are originally downloade in shared CyVerse folder under use case name, under ‘raw-data’.
    • Add file to readme in each raw folder. Must include the link to the data source and any manipulations that were done after download (e.g., covert from Excel file to CSV)
    • Include a data dictionary defining variables if needed
  2. Preprocess data
    • make data “tidy” by converting to CSV, with single table per sheet, single header row.
    • Each datasets is processed separately using the to_scrutinizer.py script in the corresponding directory of this repo.
    • standardize column headers, map to ontology templates. Output is a CSV file labeld "scrutinizer.csv". Store on CyVerse under ‘pre-processed’
  3. Load data into relational MySQL DB (Central Scrutinizer)
  4. From scrutinizer, push data to Mongo DB
    • Used to feed preliminary API
    • This step will go away once the pipeline is running, to be replaced by step 5
  5. Run data through Ontology Data Pipeline
    • Triplifier converts data to graph format (ttl files)
    • Output enhanced datasets using SPARQL queries to CSV - Store on CyVerse
    • Output data into repository format (JSON for each dataset)
    • Output data to DB format (mongo or relational or elastic)
  6. Serve data from DB via API
    • Portals pulls data using API