catalystneuro / leifer_lab_to_nwb

Conversion scripts for the Leifer lab. Includes the publication Neural signal propagation atlas of Caenorhabditis elegans (Nature, 2023).
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Subject Info YAML proposals #16

Closed CodyCBakerPhD closed 3 days ago

CodyCBakerPhD commented 1 month ago

@emosb Here are some proposals for YAML structure from our meeting today

I think we reached agreement that you should send one initial file with all subject info for the functionality connectivity project, then for current projects have multiple files, one for each day of experiments (exactly like your current logbook but more specifically structured)

a)

1:  # This is the global 'subject_id', or perhaps a 'relative' one per day; as in, first subject of the day = 1
  age_in_days: 5
  ...
  date_of_birth: 5/31/2024  

This has the advantage that the subject ID is mapped in a dictionary for easy look-up; e.g., if I load in the YAML content in Python to a variable named subject_info I can do subject_info[1] and it will give me all the information related to subject with ID equal to 1

b)

1:  # This is the global 'subject_id', or perhaps a 'relative' one per day; as in, first subject of the day = 1
  subject_id: 1
  age_in_days: 5
  ...
  date_of_birth: 5/31/2024  

This is the same as (a) except the subject_id is also included in the 'info' of that subject

This has the advantage that the value of 1 is explicitly seen to be the subject_id (in case there was any question about that)

But it has the disadvantage that both fields could get 'out of sync' with each other, such as a non-sensical entry

123:  # This is the global 'subject_id', or perhaps a 'relative' one per day; as in, first subject of the day = 1
  subject_id: 456
  age_in_days: 5
  ...
  date_of_birth: 5/31/2024  

c)

- subject_id: 1 
  age_in_days: 5
  ...
  date_of_birth: 5/31/2024  

This is the one proposed from our meeting; it is the 'simplest' since it is just a container (specifically a list), but has slight disadvantage in that if anyone wants to do a lookup of a specific subject, they would have to iterate over each entry of the list and have a logical condition for the subject_id field being the particular value they are looking for

But all information would be kept bundled together as a single dictionary and a single entry in the list

I maybe have a slight personal preference for (b), but which one to choose comes down to what you personally would prefer since you are the one working with it the most

And let me know if you have any other ideas for alternate structures

emosb commented 3 weeks ago

I agree that (b) is preferable, and I like the idea of having at least one redundant field as a check for this project, especially since this is being done by hand.

As it applies to current projects, I like the idea of iterating starting from 0 (or 1 depending on indexing preference) per each day/session (e.g. first recording of the day is 1, second of the day is 2, etc.). Once the list of recordings to include is finalized, an overall yaml file like this one could be generated, where the global subject # indicates the project-specific ids, and the subject_id info field is the internal tag. Then this info field would act as a double-check that we are pointing to the right pumpprobe folder (the ith folder from the given date). Again in this case the subject_id info field is not useful to anyone outside the lab, but it might allow a lab member to locate and verify a subject's data more quickly.

CodyCBakerPhD commented 3 days ago

It is decided - (b) it is then!

Once the file is done on your end, just send it off my way and I'll start incorporating it into the conversion