Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Revised how dataframes are stored for efficiency #931

Closed hepcat72 closed 3 months ago

hepcat72 commented 3 months ago

Summary Change Description

I realized that a number of methods were converting back and forth between dataframes and dicts and I realized that it would all be much easier if I save the production of actual dataframes until after the samples (etc) have been added.

So I converted all instances of dfs_dict from a dict of pandas dataframes to a dict of pandas' style dicts, e.g. {"columnname": {0: "row 1 value", 1: "row 2 value"}}.

This prevents the necessity of various methods converting back and forth between dataframes and dicts. It also addresses issues of places where I was temporarily using lists instead of pandas' version of lists (dicts keyed by sequential integeras starting from 0).

Affected Issues/Pull Requests

Review Notes

See comments in-line.

Checklist

This pull request will be merged once the following requirements are met. The author and/or reviewers should uncheck any unmet requirements:

lparsons commented 3 months ago

Also, it's not clear why this is stacked on top of the all_missing_treatments_exception branch and not merging directly into main. It seems tangential, but not a blocking issue.

hepcat72 commented 3 months ago

This does seem to remove some calls to to_dict() though it's not clear to me why things were DataFrames in the first place and what functionality was gained/lost.

In the later PR(s), you'll see that I'm using pandas' ability to export a dataframe .to_excel(). There may very well be a better way to do it, but I'm just going with what I know. At this point, I realized that the conversion to dataframe was simply premature. It's easier to work with the dicts.

One downside is that many of the function names are now rather misleading and the data structure is less well defined. DataFrame was easy to check for and understand how to use, but the structure now is less easy to ensure it's correct.

I think they are changed in later PRs. I'll check to make sure that they all get renamed. I think they are, but if there are any that weren't, you're right, it's misleading.

Also, it's not clear why this is stacked on top of the all_missing_treatments_exception branch and not merging directly into main. It seems tangential, but not a blocking issue.

I think what you mean to say is that it should branch off the branch before all_missing_treatments_exception, since this code depends on that code. However, I do use the treatments exception in later PRs, so it is dependent. This was my way of killing multiple birds with one stone, and the design effort on the existing exceptions aided my handling of the missing sample exceptions.

lparsons commented 3 months ago

This does seem to remove some calls to to_dict() though it's not clear to me why things were DataFrames in the first place and what functionality was gained/lost.

In the later PR(s), you'll see that I'm using pandas' ability to export a dataframe .to_excel(). There may very well be a better way to do it, but I'm just going with what I know. At this point, I realized that the conversion to dataframe was simply premature. It's easier to work with the dicts.

Makes total sense 👍🏻

One downside is that many of the function names are now rather misleading and the data structure is less well defined. DataFrame was easy to check for and understand how to use, but the structure now is less easy to ensure it's correct.

I think they are changed in later PRs. I'll check to make sure that they all get renamed. I think they are, but if there are any that weren't, you're right, it's misleading.

OK, that's good. It would be nice to make those changes here, however, before merging, instead of relying on a separate PR that isn't related to this minor refactor.

hepcat72 commented 3 months ago

Sorry, I forgot and already merged. Too many concurrent conversations, lol. I suspect they're all changed. 🤞🏻

hepcat72 commented 3 months ago

Ok. phew They are changed... with one caveat: I intentionally left get_dataframe_template(). My thinking was that they are a template for a dataframe, which explains their structure (using dict keys that are sequential integers). If you don't use integers, they don't play well with excel files that are read into dataframes (and dicts).