Add new 3d dataset API and a `gitignore` file

pphuangyi commented 3 years ago

Hi everyone,

I updated the new 3d dataset API and also added a gitignore file.

Please comment to help me improve!

Best, Yi

pphuangyi commented 3 years ago

Hi Ray,

I made the following changes

tpc_dataset.py: a dataset API that has split filename as the only input;
data_splitter.py: given path to data files generate the split files;
data_loader.py: a dataset-independent data loader function;
- We do optional subsampling of dataset in this function. When the lengths argument is given either as a single integer or as a sequence of integers, the data loader(s) we get will have numbers of examples bounded by lengths.
- Given one split file, we can get multiple data loaders. So if we also want to get a validation dataset, we can pass the train split file and specify lengths = [length_train, length_valid], and the function will return two data loaders with length_train many train examples and length_valid many validation examples. Things can be improved for this function includes that we may want to use all the examples and split the dataset by ratio instead of lengths.
- The function is almost dataset-independent. It does require the dataset API to take only one input (split file) though.
- The function is still kind of bulky. Please let me know how I can re-organize it.
- A test function that with TPC dataset API is given, please uncomment it and test.

Thank you so much!

pphuangyi commented 3 years ago

Hi Ray, Here are the updates:

dataset_utils.py: Modified the filenames;
tpc_dataloader.py: Handled seed=None, added assertions for the existence of manifest files, and corrected a few other bugs;
test_tpc_dataloader.py: Created a test module, test get_tpc_dataloaders in tpc_dataloader.py, and added a readme file.

Please let me know how I can improve it.

I will be working on the other issues in the meantime.

Thank you!

pphuangyi commented 3 years ago

Hi Ray,

I updated utils/tpc_dataloader.py, and removed the dataset_utils.py. I used shuffle from numpy and torch.utils.data.Subset to subsample a dataset, and used torch.utils.data.random_split to split a dataset as you suggested earlier today.

Here is one thing I need specifically your input on: The first function in utils/tpc_dataloader.py which does dataset subsampling is generic (has nothing specific to the TPC dataset). Please let me know whether I should put it into another file or it is okay to keep it there.

Please let me know what you think!

BNL-DAQ-LDRD / NeuralCompression

Add new 3d dataset API and a `gitignore` file #3