cytomining / cytominer-database

[DEPRECATED] A package for storing morphological profiling data.
Other
10 stars 12 forks source link

Ingest fails when columns don't align across sets #100

Open shntnu opened 6 years ago

shntnu commented 6 years ago

Some image.csv 's may have a few missing columns. However, because odo enforces a NOT NULL constraint on all columns, this throws an error.

shntnu commented 6 years ago

We'd want to optionally specify a "reference" Image.csv which has all the columns

One way to do this would be to specify the folder name in seed

def seed(source, target, config_file, skip_image_prefix=True, reference_set=None):

And then append reference_set to directories appropriately (i.e. check whether reference set exists, and then move it to the front)

directories = sorted(list(cytominer_database.utils.find_directories(source)))

This would now reduce the complexity a bit, because the only thing we need to worry about is CSVs with fewer columns that the image CSV in reference_set

shntnu commented 6 years ago

@bethac07 proposed an alternative – just randomly sampled n image CSVs and pick the one with most number of columns as the reference

diskontinuum commented 4 years ago

This was addressed in the latest merged PR "Parquet_integration #122" .
Choose the --parquet option and the added functionality (determining a reference schema for the table columns, opening a writer and converting all subsequent files to that reference schema) solves the issue. The --sqlite option does not use the additional functionality (yet).