comses / miracle

Repeatable data analysis workflows for computational models
1 stars 3 forks source link

Split Metadata Extraction into Two Pieces #28

Closed cpritcha closed 8 years ago

cpritcha commented 8 years ago

Corrected gitsubmodule to point to correct dependency Updated docker-compose with new paths Removed old test project structure skeleton Added Basic metadata upload support

Still to do:

alee commented 8 years ago

I'm still not sure I understand the need for the ProjectPath model. Internally after the user has confirmed the appropriate partitioning of datasets into DataTableGroups, we could store all the data files associated with a given DataTableGroup in a single folder and expose them as needed to DeployR / Radiant..

cpritcha commented 8 years ago

The Case for a ProjectPath Table

The Relationship between files and DataTables are Many to Many

A shapefile consists of a .dbf, a .shp a .prj and potentially some other file formats (.shx). This would mean that one DataTable has many paths. Some raster formats also consist of many files.

A sqlite database consists of a single file but contains many tables so in the metadata database that single path would consist of multiple DataTables and DataTableGroups

Since we are only dealing with csv files at the moment this could be handled later.

Makes Queries for Metadata Entry easier

When a user enters metadata they need to know what metadata they have to enter. With a paths table (that lists all the files in the project) this is relatively easy. We can select all the paths in the paths table that are not referenced by a DataTable or Analysis. It is also easy to ignore paths so the user is not pestered to complete metadata about a file path that should not have any.

If a user decides to split a DataTableGroup into pieces no special functionality is needed. Just create a DataTableGroup without any DataTables, move some DataTables to the DataTableGroup and either manually enter metadata or extract metadata from one of the DataTables. No special functionality for splitting a DataTable either. Just create a new DataTable and move any relevant ProjectPaths to it.

Without a paths table I see two possibilities.

  1. We could create a metadata skeleton for all paths and prevent the deletion of records in the project.
    • To ensure that users have approved of the job the metadata extractor has done we can set all the metadata initially to the pending status.
    • If someone wanted to split up a DataTableGroup (because the metadata extractor did not group the files correctly) this would involve creating a new DataTableGroup and changing the DataTableGroup foreign key on some of the DataTables to point to the new DataTableGroup. Splitting DataTables is a non issue because it is not possible for a DataTable to have many paths (which will be limiting if want to support shapefiles or any format where multiple files form a unit).
  2. We could allow people to delete their metadata
    • Queries of what files that user needs to complete metadata for could be computed on demand (return the set of project file system paths that are not in the paths from the DataTable and Analysis tables). This still does not solve the issue of whether or not a particular path should be ignored but an ignored file list could be kept in a separate table.
    • Splitting a DataTableGroup and DataTable is identical to the approach taken when using a ProjectPath table (although diffing between the filesystem paths and the database paths is done on demand).
warmdev commented 8 years ago

If ProjectPath is used to store all paths in the project, I think it is needed. We need a way to identify the DataTableGroup for each individual file (and thus identify the proper metadata), and it may not be feasible to group them together into the same folder (file name collision for example, if they are not pre-grouped by users)

alee commented 8 years ago

Ok, you both make a good case for keeping track of the actual files belonging to a DataTableGroup. Let's discuss some potential refactoring of this over the call this morning.