Adding new data to an asdf file [question]

Vital-Fernandez commented 2 years ago

Hello everyone. I am writting a library to measure lines (https://lime-stable.readthedocs.io/) and I would like to add support for asdf files.

The library stores the data in a pandas dataframe and then stores them as a .fits, .txt, .xlsx or .pdf as requested by the user.

Taking the .fits file template for asdf files. I convert the dataframe to a numpy record array specifying the dtypes for every column. For example:

import asdf
import numpy as np
import pandas as pd

columns_dtypes = {'wavelength': '<f8', 'intg_flux': '<f8', 'intg_err': '<f8', 'gauss_flux': '<f8', 'gauss_err': '<f8',
                'eqw': '<f8', 'eqw_err': '<f8', 'ion': '<U50', 'latex_label': '<U100', 'profile_label': '<U120',
                'pixel_mask': '<U100', 'w1': '<f8', 'w2': '<f8', 'w3': '<f8', 'w4': '<f8', 'w5': '<f8', 'w6': '<f8',
                'w_i': '<f8', 'w_f': '<f8', 'peak_wave': '<f8', 'peak_flux': '<f8', 'cont': '<f8', 'std_cont': '<f8',
                'm_cont': '<f8', 'n_cont': '<f8', 'snr_line': '<f8', 'snr_cont': '<f8', 'z_line': '<f8', 'amp': '<f8',
                'center': '<f8', 'sigma': '<f8', 'amp_err': '<f8', 'center_err': '<f8', 'sigma_err': '<f8', 'v_r': '<f8',
                'v_r_err': '<f8', 'sigma_vel': '<f8', 'sigma_vel_err': '<f8', 'sigma_thermal': '<f8', 'sigma_instr': '<f8',
                'pixel_vel': '<f8', 'FWHM_intg': '<f8', 'FWHM_g': '<f8', 'FWZI': '<f8', 'v_med': '<f8', 'v_50': '<f8',
                'v_5': '<f8', 'v_10': '<f8', 'v_90': '<f8', 'v_95': '<f8', 'chisqr': '<f8', 'redchi': '<f8',
                'aic': '<f8', 'bic': '<f8', 'observations': '<U50', 'comments': '<U50'}

# Reading the log from a .txt file
log_txt_file = './gp121903_linelog.txt'
log_df = pd.read_csv(log_txt_file, delim_whitespace=True, header=0, index_col=0)

# Saving the log to a .asdf file
log_asdf_file = './gp121903_linelog.asdf'
tree = {'log0': log_df.to_records(index=True, column_dtypes=columns_dtypes, index_dtypes='<U50')}
af = asdf.AsdfFile(tree)
af.write_to(log_asdf_file)

Doing the opposite operation, the asdf reproduces the initial dataframe:

# Reading the log in a .asdf file
ext = 'log0'
with asdf.open(log_asdf_file) as af:
    logRA = af[ext]
    logASDF = pd.DataFrame.from_records(logRA, columns=logRA.dtype.names)
    logASDF.set_index('index', inplace=True)

print(logASDF.equals(log_df))

Finally, I would like to append new tables to the asdf file. For example, producing an asdf file with multiple tables for the spaxels in and IFU datacube or an asdf containing the measurements from several logs.

I did not understand the difference between the external and internal formats and

I have been stucked in the .update function:

# Adding several extensions
for i in [0.0, 1.0, 2.0, 3.0]:
    ext = f'log{i:.0f}'
    log_df['intg_flux'] = i
    tree = {ext: log_df.to_records(index=True, column_dtypes=columns_dtypes, index_dtypes='<U50')}

    with asdf.open(log_asdf_file, 'rw') as af:
        af.tree.update(tree)
        af.update()

But this structure is failing to write the new tree:

OSError: Can not update, since associated file is read-only. Make sure that the AsdfFile was opened with mode='rw' and the underlying file handle is writable.

I wonder if you could suggest me the right procedure to append data to an asdf file.

Moreover, any advice or correction to this "pandas dataframe -> asdf file " workflow is welcomed.

gp121903_linelog.txt

WilliamJamieson commented 2 years ago

First to address the error, your .update code is almost correct. However, if you look at the asdf.open documentation notice that mode is not the second argument, instead uri is. Thus you simply need to to adjust your example to:

# Adding several extensions
for i in [0.0, 1.0, 2.0, 3.0]:
    ext = f'log{i:.0f}'
    log_df['intg_flux'] = i
    tree = {ext: log_df.to_records(index=True, column_dtypes=columns_dtypes, index_dtypes='<U50')}

    with asdf.open(log_asdf_file, mode='rw') as af:
        af.tree.update(tree)
        af.update()

This should work just fine.

Now for more general comments, you may want to write a schema and converter for the table "objects" that you are working with as this better leverages the more general powers of asdf. This is especially true if within your library you are using your own objects to handle the information (I have not looked deeply into your library to know how it works). Please feel free to ask further questions if you want to go this route. Regardless, your desire to write pandas dataframes to asdf files does serve as a use case for why asdf should provide direct support for pandas (specifically dataframes) via a converter/schema interface (something I have been thinking should be added for some time now). This will likely take the form of a secondary extension package similar to asdf-astropy which will be installed separately from asdf itself.

I am tagging @perrygreenfield to suggest the "right/suggested" procedure for updating an existing asdf file.

Vital-Fernandez commented 2 years ago

Thank you very much @WilliamJamieson and @perrygreenfield for all this work and for taking the time to answer my questions.

Your solution worked perfectly.

In my library, I am not creating any special objects to store the data. I am just using dataframes for easy indexing the measurements.

However, to store the data "physically", I would rather follow the guidelines for the files format. For example, if the user wants a .fits with the measurements from several spectra, each new table updates the .fits via this workflow:

Pandas Dataframe -> Record array -> Columns -> Binary HDU table -> Append the BinTableHDU¶ to the file with a given extension

I think that record arrays aren't a good input for asdf (they don't work with compressions). So if you have any suggestion on the fastest/smallest-file workflow to update an asdf file with multi-dtype tables in asdf files, I would love to have your insight.

Thanks again.

perrygreenfield commented 2 years ago

Appending data may be simple or not so simple depending on the details. For the case where one is adding new arrays or tables, appending is in principle a low cost I/O operation, so far as there is sufficient space in the yaml header for the additional entry pointing to the added binary block(s). The solution to avoiding a massive rewrite of the binary blocks is to build in extra padded space into the yaml header to anticipate further additions. And that made me realize we don't yet have a mechanism for padding the header! Thanks for inadvertently bring this to our attention. It definitely is needed.

CagtayFabry commented 2 years ago

Does the update workflow also work on 'exploded' asdf files? That could prevent any changes to the binary files unless changed.

WilliamJamieson commented 2 years ago

Does the update workflow also work on 'exploded' asdf files? That could prevent any changes to the binary files unless changed.

This should work; however, I have not tested if this actually works. If you try this and it fails for some reason then that is a definite bug.

asdf-format / asdf

Adding new data to an asdf file [question] #1136