AllenNeuralDynamics / aind-data-schema-models

Data models used in aind-data-schema
MIT License
0 stars 1 forks source link

50 generate subset groups automatically from csv files #58

Closed dbirman closed 1 month ago

dbirman commented 3 months ago

PR adds an option to generate a subset group, previously done with:


MouseAnatomicalStructure.EMG_MUSCLES = one_of_instance([
    MouseAnatomicalStructure.DELTOID,
    MouseAnatomicalStructure.PECTORALIS_MAJOR,
    MouseAnatomicalStructure.TRICEPS_BRACHII,
    MouseAnatomicalStructure.BICEPS_BRACHII,
    MouseAnatomicalStructure.PARS_SCAPULARIS_OF_DELTOID,
    MouseAnatomicalStructure.EXTENSOR_CARPI_RADIALIS_LONGUS,
    MouseAnatomicalStructure.EXTENSOR_DIGITORUM_COMMUNIS,
    MouseAnatomicalStructure.EXTENSOR_DIGITORUM_LATERALIS,
    MouseAnatomicalStructure.EXTENSOR_CARPI_ULNARIS,
    MouseAnatomicalStructure.FLEXOR_CARPI_RADIALIS,
    MouseAnatomicalStructure.FLEXOR_CARPI_ULNARIS,
    MouseAnatomicalStructure.FLEXOR_DIGITORUM_PROFUNDUS,
])

By:

mouse_objects = read_csv(str(files("aind_data_schema_models.models").joinpath("mouse_dev_anat_ontology.csv")))
MouseAnatomicalStructure.EMG_MUSCLES = subset_from_column(MouseAnatomicalStructure, mouse_objects, "EMG_MUSCLES")

and putting a "1" in a new column EMG_MUSCLES in the CSV file in each row that should be included.

We could go one step further and auto-generate subsets for any column that shows up in SCREAM_CASE?

saskiad commented 2 months ago

so this means to make new subsets, we add columns to the CSV file, and if we want to add new items to a list (e.g. more emg options) we update the CSV? I'm a little reluctant on this as I think keeping the CSV static will be more robust. I can imagine that we have one list of EMG muscles, but another group has a different list of body parts they want to use, and we then end up with different versions of the csv floating around.

dbirman commented 2 months ago

I see that negative for sure. The positive I was thinking about is that it's not very user friendly to generate the subset lists right now since the model.ATTRIBUTE values aren't visible when you're writing code. When I wrote the EMG list I was basically tabbing back and forth between the CSV and the code file and manually turning the names into the capital + underscore versions that get attached to the model. I believe the tests fail if you make a typo, but this would be robust to that.

dbirman commented 2 months ago

2024/09/05 - discussed w/ Saskia, pulling in David to discuss issues around very large CSV files (100k+ rows)