flatironinstitute / flathub

A simple elasticsearch frontend for serving astrophysical simulation catalog data
http://astrosims.flatironinstitute.org/
Apache License 2.0
10 stars 6 forks source link

Convenient way to dump the entire catalogue #124

Closed DebajyotiS closed 6 months ago

DebajyotiS commented 6 months ago

Hello all, I am trying to download the GDR3 release onto my cluster using flathub, however I keep running into memory issues. Here's my py code snippet.

import flathub
import pandas as pd
from time import process_time

gaiadr3 = flathub.Catalog("gaiadr3")
count = gaiadr3.count()
print(f"[--] Downloading {count} objects")

time = process_time()
dat = gaiadr3.numpy(fields = ["source_id","ra","ra_error","dec","dec_error","l","b","parallax","parallax_error","pmra","pmra_error","pmdec","pmdec_error","phot_g_mean_mag","phot_bp_mean_mag","phot_rp_mean_mag","bp_rp","bp_g","g_rp","radial_velocity","radial_velocity_error"])
total_time = process_time() - time
print(f"[--] Data downloaded in {total_time:.2f} seconds")

pd.DataFrame(dat).to_hdf("gaiadr3.h5", key="df")

This runs into

[--] Downloading 1811709771 objects
Traceback (most recent call last):
  File "/srv/beegfs/scratch/users/s/senguptd/gaiadump/download.py", line 8, in <module>

  File "/home/users/s/senguptd/.local/lib/python3.10/site-packages/flathub-2.0-py3.10.egg/flathub/client.py", line 379, in numpy
  File "/opt/conda/lib/python3.10/site-packages/numpy/lib/npyio.py", line 432, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/opt/conda/lib/python3.10/site-packages/numpy/lib/format.py", line 790, in read_array
    array = numpy.fromfile(fp, dtype=dtype, count=count)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 196. GiB for an array with shape (1811709771,) and data type [('source_id', '<i8'), ('ra', '<f8'), ('ra_error', '<f4'), ('dec', '<f8'), ('dec_error', '<f4'), ('l', '<f8'), ('b', '<f8'), ('parallax', '<f8'), ('parallax_error', '<f4'), ('pmra', '<f8'), ('pmra_error', '<f4'), ('pmdec', '<f8'), ('pmdec_error', '<f4'), ('phot_g_mean_mag', '<f4'), ('phot_bp_mean_mag', '<f4'), ('phot_rp_mean_mag', '<f4'), ('bp_rp', '<f4'), ('bp_g', '<f4'), ('g_rp', '<f4'), ('radial_velocity', '<f4'), ('radial_velocity_error', '<f4')]

Is there a way to do this in chunks while making sure I get all of the 1.8+ billion sources?

dylex commented 6 months ago

Most of the catalogs should have direct download links to the original sources on the about page. For GAIA DR3, you can find full download information here: https://sdsc-users.flatironinstitute.org/~gaia/dr3/README.txt This will be much more efficient than pulling all the data out of the database.

However, the memory issue is just a limitation of the python client, which downloads into a numpy array in memory. You can download directly to a file through the API if you really want to, but it will be very slow.