dataset_names determined in two different ways

acomrie commented 2 years ago

It appears that dataset_names is determined in two slightly different ways, but then the one way is used to access the other. Most of the time the two ways yield the same result, but when they do not, a bug arises.

nwb_file_builder.py's __extract_datasets() calls self.data_scanner.extract_data_from_date_folder(), which goes on to use __extract_experiments() which then uses __extract_datasets() in data_scanner.py. That final __extract_datasets() determines dataset_names with one approach. Then, separately, in nwb_file_builder.py, dataset_names is determined with another approach, using self.data_scanner.get_all_epochs(). It seems like these two spots should use the same rather than two slightly different approaches to determine dataset_names. That is because later, in nwb_file_builder.py, we do the following list comprehension: [self.data_scanner.data[animal_name][date][dataset] for dataset in self.dataset_name] If the two approaches have yielded distinct dataset_names, then the dataset key will not have any value.

The two approaches seem to yield distinct dataset_nameswhen a filename has an optional field following the sleep/run descriptor. For example, instead of _date_animal_epochnumberdescription.fileending, if the filename is _date_animal_epochnumber_descriptionoptionaltag.fileending, then the dataset_names may have either the form _epochnumberoptionaltag or _epochnumberdescription with the two appraoches, respectively. Instead, we want both methods to yield a string of the form _epochnumberdescription.

I can rename files to avoid having this optionaltag at the end of the filename if needed, but it would be useful to be able to add optional info into the filename, while still satisfying our documentation's naming conventions (per regular expression: ^(\d*)_([a-zA-Z0-9]*)_(\d*)_{0,1}(\w*)\.{0,1}(.*)\.([a-zA-Z0-9]*)$). Moreover, it is still odd for us to determine dataset_names in two different ways and then use them as if they are the exact same. Is there a reason it is the way it is that I am missing? Would it be okay to change it so that both get_all_epochs() and __extract_datasets() determine the dataset_names the same way? I can work on PR if someone who is more familiar with this code thinks it seems okay, but I am not super familiar with this code so wanted to see what folks think first. Thanks much for any thoughts.

Example error -

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_3397871/450084155.py in <module>
     37                               video_path=video_path,
     38                               trodes_rec_export_args = trodes_rec_export_args)
---> 39         content = builder.build_nwb()
     40         print(content) #just to look
     41         # Automatically delete preprocessing files? Not for this guy given old pipeline use still

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/raw_to_nwb_builder.py in build_nwb(self, run_preprocessing, process_mda_valid_time, process_mda_invalid_time, process_pos_valid_time, process_pos_invalid_time)
    224             process_mda_invalid_time=process_mda_invalid_time,
    225             process_pos_valid_time=process_pos_valid_time,
--> 226             process_pos_invalid_time=process_pos_invalid_time)
    227         logger.info('Done...\n')
    228 

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/raw_to_nwb_builder.py in __build_nwb_file(self, process_mda_valid_time, process_mda_invalid_time, process_pos_valid_time, process_pos_invalid_time)
    237         for date in self.dates:
    238             logger.info('Date: {}'.format(date))
--> 239             nwb_builder = self.get_nwb_builder(date)
    240             content = nwb_builder.build()
    241             nwb_builder.write(content)

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/raw_to_nwb_builder.py in get_nwb_builder(self, date)
    273             reconfig_header=self.__get_header_path(),
    274             # reconfig_header=self.__is_rec_config_valid()
--> 275             **old_dataset_kwargs
    276         )
    277 

~/Src/rec_to_nwb/rec_to_nwb/processing/tools/beartype/beartype.py in func_beartyped(__beartype_func, *args, **kwargs)

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/nwb_file_builder.py in __init__(self, data_path, animal_name, date, nwb_metadata, process_dio, process_mda, process_analog, process_pos_timestamps, preprocessing_path, video_path, output_file, reconfig_header, is_old_dataset, session_start_time)
    218         validation_registrator.validate()
    219 
--> 220         self.__extract_datasets(animal_name, date)
    221 
    222         self.corrupted_data_manager = CorruptedDataManager(self.metadata)

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/nwb_file_builder.py in __extract_datasets(self, animal_name, date)
    339         self.data_scanner.extract_data_from_date_folder(date)
    340         self.datasets = [self.data_scanner.data[animal_name]
--> 341                          [date][dataset] for dataset in self.dataset_names]
    342 
    343     def build(self):

~/Src/rec_to_nwb/rec_to_nwb/processing/builder/nwb_file_builder.py in <listcomp>(.0)
    339         self.data_scanner.extract_data_from_date_folder(date)
    340         self.datasets = [self.data_scanner.data[animal_name]
--> 341                          [date][dataset] for dataset in self.dataset_names]
    342 
    343     def build(self):

KeyError: '01_sleep'

And then if we take a look here, self.data_scanner.data[animal_name][date] evaluates to

{'sleep_p1': <rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762e27490>, 'sleep_p2': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762e27990>, 'sleep_p3': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762e27f90>, 'sleep_p4': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762e27610>, 'sleep_p5': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762f3f790>, 'lineartrack_p1': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f276471dc10>, 'lineartrack_p2': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2763a846d0>, 'lineartrack_p3': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2763a84c50>, 'lineartrack_p4': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2763a84390>, 'lineartrack_p5': 
<rec_to_nwb.processing.tools.dataset.Dataset object at 0x7f2762f3e510>}

So it is looking for '01_sleep' correctly, but the thing it is looking for is currently incorrectly called 'sleep_p1'

lfrank commented 2 years ago

I can't think of any reason that would cause a problem, so please do try to fix it.

acomrie commented 2 years ago

I have attempted this, and unfortunately filenames are parsed and inferred in a bunch of different places, and the different places do so slightly differently (rather than all with a few core fxns). I don't think it will be smart to half-fix it with my current changes, and it would be better to either leave it as is and update our lab file naming documentation, or to overhaul some of the rec to nwb code to be more internally consistent in terms of filenaming expectations. I'll go with the former given time considerations at the moment.

acomrie commented 2 years ago

fyi in case people come across this in the future, the updated documentation from google drive now starts w/:

Basic file name format basic regular expression to match file (python re format): ^(\d)_([a-zA-Z0-9])(\d*){0,1}(\w).{0,1}(.).([a-zA-Z0-9]*)$

6 groups: date, animal name, epoch #, label (optional), continuation extension (optional), file extension

date: although reg ex only matches digits, the date should be of format %YYYY%MM%DD animal name: alphanumeric epoch: digits (every epoch 2 digits ##, multiple epochs in a file by series to digit pairs 010203) label: alphanumeric (optional) continuation extension: anything (recommend separator as period '.' but can be any whitespace non-alphanumeric. If separator is period, it is excluded from the match, if it is anything else it is included in the match. For camera/position data used to signify file continuation e.g. 01, 02, 03....) file extension: alphanumeric (e.g. rec, stateScript, h264)

20220506 UPDATE: our current rec_to_nwb code does NOT use the regular expression described above Instead use this format: date_animalname_epochnumber_label.fileextension With date as digits in format %YYYY%MM%DD, animalname as alphanumeric and the SAME as the animal directory name, epoch number as 2 digits (for ex: 01, 02, ... 13, 14...), and the final field before the .fileextension should describe the epoch type alphanumerically without any more underscores or periods (for ex: sleep, run, sleep5, r5, linear9...).

LorenFrankLab / rec_to_nwb

dataset_names determined in two different ways #22