dat-ecosystem / dat

:floppy_disk: peer-to-peer sharing & live syncronization of files via command line
https://dat.foundation
BSD 3-Clause "New" or "Revised" License
8.25k stars 449 forks source link

Astronomy use case: Trillian #172

Closed demitri closed 8 years ago

demitri commented 10 years ago

Trillian Data Needs

This document will describe the data needs in detail for the Trillian project. Trillian is an attempt to address the difficulty (or inability) to easily analyze the hundreds of terabytes of publicly available astronomical data.

Introduction

Trillian is designed to be a computing engine for astronomical data, consisting of two basic parts. The first is the computational aspect, where users will create astrophysical models (in practice, Python code) that describe a particular object — a type of star, galaxy, etc. — which will then be applied to all available data. The result is a likelihood value assigned to each object analyzed based on how will the data matches the model. The second component is a distributed data network. No astronomy department or institution has the disk space to store the amount of data available, and even if they did, the bookkeeping and organization is well beyond the time and capabilities of most astronomers. This document will focus on describing the latter component and the nature of astronomical data.

Astronomy Data is Multi-Wavelength

People are familiar with the idea of looking at something to study it, but if you are using only your eyes you see just a very narrow part of the electromagnetic spectrum. I appear very different when someone looks at me with their eyes (visible), with a pair of night-vision glasses (infrared), or takes an x-ray photo of me. My iPhone still works if I completely cover it with my body because I am transparent to both radio and WiFi signals. This is how astronomers understand the universe; we observe things in multiple wavelengths. Dust in the galaxy blocks optical light, but the longer wavelength infrared light passes right through, allowing us to see beyond. Further, the wavelength corresponds to temperature – an object emitting light at short wavelengths (e.g. gamma rays, x-rays) is far hotter than one emitting long wavelengths (e.g. radio waves). By studying objects in different wavelengths, we area actually studying different physical processes.

Telescopes, satellites, or other astronomy detectors typically operate in a single wavelength or a very narrow range (compared to the spectrum). Consequently, data releases from a survey cover one or a few wavelengths. To fully understand a particular object (e.g. a star, a planet, a galaxy), one wants to collect as many observations covering as many wavelengths as possible. Currently, this means going to several web sites where the data is available, each with a very different interface (IF there is a web interface and not just files!), each with a different structure. Manually collating these observations is tedious and time consuming, and doing this for hundreds of thousands of objects is nearly impossible. This is the problem we want to solve.

Astronomical Data Formats

Astronomy data is typically found in one of two formats: flat (ASCII) files or FITS format. If I give an astronomer an image taken from a telescope, alone it’s almost worthless. She would need to know the position on the sky, the exposure time, the location of the telescope, the instrument used, the wavelength, etc. Rather than keep this metadata in a separate file from the image data, it’s kept in a header associated with the image in the same file. This header is simply a collection of key-value pairs. Together, the image and the header form a header data unit (HDU). The data may take the form of an image or a table (up to 999 columns). Finally, a FITS file may contain any number of HDUs.

A data release from a survey will either be a large collection of ASCII table files or FITS files, where there is no standard/common format of the number of HDUs in a file, although there is a small number of header keywords that are standard. This is necessary due to the complexity of the data, but it makes organizing the data more difficult. Data releases are typically the result of many years of observation and analysis. The larger surveys (e.g. SDSS) will create a new data release every few years, and this completely supersedes the last one (though it’s useful to at least keep identifiers that people use from older releases). Some surveys’ releases are incremental (e.g. Hubble), where the intervening time was spent observing completely new objects. It is uncommon for surveys to release data in small, frequent doses; astronomical data can be to first order treated as large, unchanging data sets.

How Trillian Will Oragnize Data

Data releases, as above, are primarily cover one or a few wavelengths. If one wants to combine as many observed wavelengths of a single object together, it’s not efficient to store them as such. There is a system called Healpix that divides a sphere into equal sized pixels. This is preferred over something like the longitude/latitude system where the area covered by one degree in longitude differs by the latitude position. Trillian will take a single Healpix pixel and collect all available information that is located in that position on the sky from each data set. Let’s call this a Pixel (with a capital ‘p’) for now.

The data storage in Trillian will be distributed; this is how we will scale. For example, imagine we have a central server at OSU in Ohio. A server at NYU in New York has 10TB to offer. Trillian will determine how many Pixels can be stored in that space, assemble them, and place them there; this is now a storage node. A PostgreSQL database on the central server keeps track of all the Pixels. Each node will also have a PostgreSQL server to manage the data there.

Data Access

The central server will want to retrieve all of the information available for a given object in the sky. Based on the position, it will know which Pixel and node the data is stored. We want to get this data to feed to a program to analyze it (apply the model). There are two scenarios: the storage node is also a compute node, and the model will be sent to the node, or the compute node is elsewhere and will have to retrieve the data. We would like to implement an API such that a message can be sent to the storage node and have it retrieve the data. It shouldn’t matter to the system whether the data is on a remote node or local – it will just be a call to the API, where the location is just a parameter.

Some of the data will be in a tabular format, which is probably best kept in a database to allow for complex queries. However, some data will be in the FITS format. Some of that can be fully uploaded into a database (the schema of course being more than a single table). However, there is no benefit to loading images into a database – they cannot be searched on. The API though may need to open the image, extract some array of pixels, and return that. There will then be a need for translators that will know the particular details of the file and the data format to extract the data on demand.

Additional information

thadguidry commented 10 years ago

"To fully understand a particular object (e.g. a star, a planet, a galaxy), one wants to collect as many observations covering as many wavelengths as possible. "

How can dat help solve this ? Automatic collections from various sites with various data file formats, sounds like a coordination problem at the initial stages. Getting observations in a standard format, and getting community / publishers agreement is a challenge. How do you propose to solve that 1st step, despite dat ?

demitri commented 10 years ago

What we’re exploring is how DAT might be a good platform for data access.

Automatic collections from various sites with various data file formats, sounds like a coordination problem at the initial stages.

True, and I don't expect DAT to solve that problem. The first step is just getting data from as many sources as possible into one place (or at least making it appear it's all in one place). I don't expect the data collection to be automatic– astronomy data releases are sporadic and reasonably unique. They can be organized into a coherent collection, but that will be manual. Considering this is “relatively” rare (less than annually per major survey, sometimes no more than once), this is not unreasonable. I expect that from DAT’s point of view the data will be available, structured, and unchanging.

Getting observations in a standard format, and getting community / publishers agreement is a challenge.

I don’t expect the producers of the data to deliver it in any particular format– I’m happy for Trillian to get what we get and structure it ourselves.

Where I see DAT playing a role is in data access. The compute engine will request all observations of, say, a particular galaxy, which will require some kind of transport protocol/middleware/API/something. If this is built for the compute engine, there's no reason why it can't be exposed directly to the public as well. As new observations are added to Trillian, the data can be "updated".

thadguidry commented 10 years ago

Where / what websites do astronomers use to publish / push their existing data in different wavelengths to now ? What are these other sites and surveys beyond what you gave in your example data sets ?

demitri commented 10 years ago

There are no sites that astronomers regularly push their data to. A data release is making the data available; it's up to the astronomer to download it and do something with it. "Making the data available" can mean providing a web form or even just ASCII or FITS files.

There are sites that try to pull data together, but they are not complete and there are difficulties in working with them. For example if want to look for thousands (or more) of objects, the interfaces don't really work. Again, they provide files (or simple tables), and it's up to the astronomer to do anything else with them. Some to look at:

Those are major, well funded sites, and for the most part the exceptions in terms of data interfaces. But there's a long tail of satellite and ground-based catalogs available including ones from other countries, e.g. GALEX, Herschel, Chandra, Tycho-2, Spitzer, AKARI, UCAC4, GLIMPSE, SMOG, Cyg-X, Vela-Carina, Planck, ... dozens and dozens. Some cover the whole sky; most do not.

The aim of Trillian is to gather the most scientifically useful catalogs together (usually the largest!) to compute models against. This is something astronomers can't really do against this much data on an all-sky basis now.

max-mapper commented 10 years ago

edit accidentally hit Submit halfway through writing this, so ignore the initial email

Summary of our Google Hangout earlier:

Initial raw data sources/collections/scans:

SDSS

Would probably do the initial prototype using a single "stripe" instead of the entire sky

WISE

http://irsa.ipac.caltech.edu/ibe/cutouts.html http://wise2.ipac.caltech.edu/docs/release/allsky/expsup/sec2_2a.html http://irsadist.ipac.caltech.edu/wise-allwise/ http://irsa.ipac.caltech.edu/ibe/data/wise/merge/merge_p1bm_frm/ http://irsa.ipac.caltech.edu/ibe/data/wise/merge/merge_p1bm_frm/0a/00720a/001/00720a001-art-w1-D.tbl

2MASS

Probably the least complex of the 3 initial raw data sources

Overview

The Trillian project wants to build a query engine that can take in some query parameters, e.g. position, spectrum properties (e.g. color) and receive back data that is either

This query engine will be backed by a PostgreSQL db that will need to be populated with metadata from all of the different raw data sources.

A sub-component of the query engine will be a "cutout" service that operates on FITS files. It will take the input parameters and a list of matched FITS files that contain matching data for the input parameters and combine the group of FITS files into a single FITS file (e.g. a 'cutout'). This will mostly involve spawning a CLI tool and passing multiple FITS filenames to it along with other arguments.

The main goal of using dat in this system would be to normalize/standardize the raw data access layer.

Where dat can help

Indexing raw data and serving binary data to Trillian

We should make a dat database for each collection. We would write a data indexer/importer script that imports all of the files into dat as remote blobs (meaning the files themselves aren't stored in data, only a link to the remote file). If the raw data is available over a protocol other than HTTP we can configure these dat dbs to use the appropriate blob store backend module to correctly access the remote blobs.

We should probably first start with a subset "stripe" of SDSS.

Feeding data into the Trillian PostgreSQL DB

We should figure out the best way to keep the PostgreSQL DB up to date with the latest data from dat. This might be a sort of 'push' mechanism that automatically inserts data into PSQL from dat, or some other option. More investigation is needed here, but the goal would be to remove the need to write custom data import scripts for the PostgreSQL DB and instead just have it be 'subscribed' or 'linked' to a dat db. That way when we add new raw data collections we just have to import them into dat and then the PostgreSQL DB will automatically be able to start using them.

Open questions

Other random links:

https://github.com/itpmngt/FITS https://github.com/astrojs/fitsjs https://github.com/datproject/gasket https://github.com/maxogden/dat/issues/172 http://irsa.ipac.caltech.edu/ibe/data/wise/merge/merge_p1bm_frm/0a/00720a/001/00720a001-art-w1-D.tbl https://code.google.com/p/q3c/ http://api.sdss3.org/spectrum?id=boss.3840.55574.029.v5_4_45&format=json bzip2 -dc wise-allsky-cat-part01.bz2 | sed 's/|$//' | psql --command "COPY wise.psc FROM stdin WITH DELIMITER '|' NULL AS ''" https://gist.github.com/mafintosh/0248b745e927c4102351 http://wise2.ipac.caltech.edu/docs/release/allwise/expsup/sec2_1a.html http://irsadist.ipac.caltech.edu/wise-allwise/wise-allwise-cat-schema.txt

max-mapper commented 10 years ago

@demitri Hey Demitri!

Just wanted to let you know we have a new member of our team, @ywyw, who will be working with me on normalizing the FITS data from 2MASS, WISE and SDSS. We are getting started this week and will probably have questions for you soon!

joehand commented 8 years ago

This issue was moved to datproject/discussions#49