Open shntnu opened 6 years ago
We'd want to optionally specify a "reference" Image.csv which has all the columns
One way to do this would be to specify the folder name in seed
def seed(source, target, config_file, skip_image_prefix=True, reference_set=None):
And then append reference_set
to directories
appropriately (i.e. check whether reference set exists, and then move it to the front)
directories = sorted(list(cytominer_database.utils.find_directories(source)))
This would now reduce the complexity a bit, because the only thing we need to worry about is CSVs with fewer columns that the image CSV in reference_set
@bethac07 proposed an alternative – just randomly sampled n image CSVs and pick the one with most number of columns as the reference
This was addressed in the latest merged PR "Parquet_integration #122" .
Choose the --parquet
option and the added functionality (determining a reference schema for the table columns, opening a writer and converting all subsequent files to that reference schema) solves the issue.
The --sqlite
option does not use the additional functionality (yet).
Some image.csv 's may have a few missing columns. However, because odo enforces a NOT NULL constraint on all columns, this throws an error.