COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
58 stars 25 forks source link

Refactor index_experiment to take only a list of files to index. #234

Closed aidanheerdegen closed 3 years ago

aidanheerdegen commented 3 years ago

Closes #233

Logic about which files to filter and prune is moved up to build_index.

Added prune_files function which is called from prune_experiment and build_index where required.

Made a dedicated file_files function with option to specify a pattern to match when globbing.

aidanheerdegen commented 3 years ago

Hmm, most of the tests are passing when I run them locally. Have to take a look tomorrow.

aidanheerdegen commented 3 years ago

I can reproduce these errors by upgrading to sqlalchemy=1.4.1 (current unstable conda env has 1.3.22). So it seems like there has been a change in behaviour.

aidanheerdegen commented 3 years ago

Maybe there is something in here that rings a bell @angus-g ?

https://docs.sqlalchemy.org/en/14/changelog/migration_14.html

Sounds like it is a pretty big upgrade. Could just pin testing to 1.3 until it is sorted ...

angus-g commented 3 years ago

Yeah, let's pin to 1.3 until we work out the changes needed for 1.4!

aidanheerdegen commented 3 years ago

I need to add some tests, but wanted to get some feedback @angus-g. Is this a useful/good way to proceed?

angus-g commented 3 years ago

Is this a useful/good way to proceed?

I think so! Still maybe a few too many flags on build_index? I'm not sure if we need update, prune or delete: it seems like updating is probably something you always want to do (and/or it's covered by force), and I doubt anybody used the other two, plus they're better covered by dedicated functions imo.

aidanheerdegen commented 3 years ago

Still maybe a few too many flags on build_index? I'm not sure if we need update, prune or delete: it seems like updating is probably something you always want to do (and/or it's covered by force), and I doubt anybody used the other two, plus they're better covered by dedicated functions imo.

I agree update seems sort of redundant. Struggling to think of a use case where you'd call build_index with update=False. So I'd be up for removing it. As you say, not sure anyone uses it.

prune also seems like a no-brainer in order to keep the DB in a non-broken state. In a way delete is almost better as an option for prune. As in yes always prune by default, and that can either be set to missing or delete. Not sure which is a better default.

aidanheerdegen commented 3 years ago

I've updated one of the tests to test the force option. This fails because of the UNIQUE constraint:

>       cursor.execute(statement, parameters)
206
E       sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: ncfiles.experiment_id, ncfiles.ncfile
207
E       [SQL: INSERT INTO ncfiles (index_time, ncfile, present, experiment_id, time_start, time_end, frequency) VALUES (?, ?, ?, ?, ?, ?, ?)]
208
E       [parameters: ('2021-03-19 04:43:03.276985', 'output000/test.nc', 0, 1, None, None, None)]
209
E       (Background on this error at: http://sqlalche.me/e/13/gkpj)

I took a look but it wasn't obvious to me how to fix this. Do you have any hints @angus-g?

angus-g commented 3 years ago

A sledgehammer approach could be to add the UniqueMixin like Keyword or CFVariable, and then update all the indexed files with as_unique.

I think it'd be easier and less fiddly to just query for the file in the "update all variables to be unique" loop, and delete existing matches though, since the UniqueMixin probably isn't quite set up for a top-level object like that.

aidanheerdegen commented 3 years ago

Ugh. Now build_index is getting bigger and more unwieldy. Happy to hear if there is a more elegant way to do the deletion of existing NCFile records @angus-g.

aidanheerdegen commented 3 years ago

Thanks for the review and sqlalchemy magics @angus-g

aidanheerdegen commented 3 years ago

I agree. I thought about fixing https://github.com/COSIMA/cosima-cookbook/issues/238 as well, as I think it messes a bit with the final logic used in this change, but decided to leave that to another day.

navidcy commented 3 years ago

I agree update seems sort of redundant. Struggling to think of a use case where you'd call build_index with update=False. So I'd be up for removing it. As you say, not sure anyone uses it.

oh, it was used lots. :) Especially after being part of the cosima-recipes tutorial about creating custom db.