astrodbtoolkit / astrodb_utils

https://astrodb-utils.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Use `sqlalchemy` classes for ingesting instead of dictionary #43

Open kelle opened 7 months ago

kelle commented 7 months ago

Takes more advantage of the schema and is more ...direct. @dr-rodriguez to provide an example.

dr-rodriguez commented 7 months ago

Here is an example from the documentation (see https://astrodbkit2.readthedocs.io/en/latest/#using-the-sqlalchemy-orm):

from astrodbkit2.astrodb import Database
from schema import Sources

db = Database(connection_string)

# Create a new object and insert to database
s = Sources(source="V4046 Sgr", ra=273.54, dec=-32.79)
with db.session as session:
    session.add(s)
    session.commit()

This can work for any table, not just Sources and we can create list of these objects as well and add them all in one go.

kelle commented 7 months ago

If I wanted to add more values, would I do:

s.comment = "string comment"

dr-rodriguez commented 7 months ago

I'm actually not sure- I expect so, but worth trying it out.

dr-rodriguez commented 7 months ago

Confirmed that things like s.comments = "this is a comment" work, so one can build the object iteratively.

dr-rodriguez commented 6 months ago

I've been thinking about this some more and would like to propose we create a function in astrodb_utils of the type:

def ingest_object(db, DBTable, data_dict):
    obj = DBTable(**data_dict)
    with db.session as session:
         session.add(obj)
         session.commit()

In this function, db is the AstrodbKit2 Database object, DBTable is a SQLAlchemy ORM table (eg, Sources in from schema import Sources), and data_dict is a dictionary that contains the information to store with keys that match the column names.

There are a lot of benefits to this approach:

  1. This is completely table agnostic. If you want to load a new source: ingest_object(db, Sources, data_dict), if you want to ingest new photometry: ingest_object(db, Photometry, data_dict), new spectra?, new gravities? it all works the same way.
  2. Validation happen in several places, not only do the dictionary keys have to match but any validation/transformation in the schema python file gets run in addition to any database checks on ingest. This can help catch more issues earlier.
  3. The table schema becomes even more flexible. If a user wants their own database, but have the Sources table with Galactic l,b vs Equatorial ra, dec, they can do so and this will still work.
  4. There is no need to write ingest functions for each table- this one takes care of all of them and all we need to provide is helper functions for any data transformations needed in data_dict and good documentation so the users know which ones they can use and how an ingest pipeline can look like. In fact, I would argue against creating "blackbox" functions like ingest_sources and instead guide users on how they would build a script that calls this ingest_object function.

I strongly recommend we consider this as it will make creating and maintaining the Toolkit far easier. There's still a place for helper functions, but this does mean that the Toolkit moves to being more guidelines and tutorials as opposed to lots of specific code.

Thoughts @kelle @arjunsavel @Will-Cooper?

kelle commented 6 months ago

I'll need to try it out!

One of the most important things is if the error messages it throws are useful. I would say that's one of the most important aspects of the ingest functions: they do a bunch of checks and provide useful feedback.

dr-rodriguez commented 6 months ago

That is a fair point and I think that's where some of the schema validation functions can come in handy. That said, one can also have functions that check certain aspects of data_dict to help with this prior to the actual ingest, in a similar way that our older functions do.

Will-Cooper commented 6 months ago

I do like that approach, especially if decorated with something like np.vectorize so that we can write the logic for a single object but can still pass an entire list of objects. Would we then just write a long list like ra = data_dict.get('ra', None)?

Confirmed that things like s.comments = "this is a comment" work, so one can build the object iteratively.

Could handle things like with property setters, with nice error messages returned, e.g. I have code like:

    @staticmethod
    def __assertquantity__(value: Optional[u.Quantity], optional: bool):
        if optional and value is not None and not isinstance(value, u.Quantity):
            raise AttributeError(f'Value {value} must be numeric or a quantity (or None)')
        elif not optional and not (isinstance(value, u.CompositeUnit) or isinstance(value, u.Unit) or
                                   isinstance(value, u.IrreducibleUnit)):
            print(value, type(value))
            raise AttributeError(f'Value {value} must be a unit')
        return True

    @property
    def wunit(self) -> u.Unit:
        return self._wunit

    @wunit.setter
    def wunit(self, value):
        if not self.__assertquantity__(value, False):
            raise AttributeError('wunit must be an astropy unit')
        self._wunit = value
dr-rodriguez commented 6 months ago

No need for fetching the individual columns, that get's unpacked as part of **data_dict. If a required column is not provided, SQLAlchemy would error out. And as for setters/getters- that's also already built into the ORM, see for example the validation we have in https://github.com/astrodbtoolkit/astrodb-template-db/blob/main/schema/schema_template.py#L179 We can expand that as needed, and while it's called validation it can also serve to do transforms (eg, to extra the value of a Quantity object in a particular unit).

As for passing a list, we could vectorize it, but SQLAlchemy already has logic to handle inserting lists. For example, we could rewrite the main logic as:

data_list = []
for row in data_dict_list:
        datum = DBTable(**row)
        data_list.append(datum)

with db.session as session:
         session.add_all(data_list)
         session.commit()

One of the interesting things about this approach is that if you were doing this manually you can have data_list contain a list of a mix of objects- some Sources, some Photometry, some Spectra, and still use session.add_all to ingest them all in one call. I think writing it up this way is beyond the scope of what we want to do, but we can provide guidelines if people want to approach it that way.