comses / miracle

Repeatable data analysis workflows for computational models
1 stars 3 forks source link

uploaded analyses structure #14

Closed alee closed 8 years ago

alee commented 9 years ago

Design a filesystem organization scheme for uploaded analyses. Here's a first cut:

All uploaded resources must belong to a Project and are assumed to be placed in settings.FILESTORE_LOCATION/projects/<project_id>/ as the base root (referred to as PROJECT_ROOT from here on out.

Any feedback / modifications welcome. Once this is finalized we should put it up on the wiki.

alee commented 9 years ago

@warmdev please feel free to chime in here too. Thanks!

cpritcha commented 9 years ago

I was thinking that projects would not be a part of the file system. Instead, projects exist only in the database as a way of grouping analyses and only analyses are stored in the filesystem (maybe what you are calling a project is what I think of as an analysis?).

The first point about datatables being one to one with files I think we should change. In the LUXE example, many of the output csv files should have the same metadata. In order to that we need the relationship to be one to many with the files (we could do this by another table with datatable ids and filenames). Then datasets in the model would refer to groups of datatables (that do not share metadata).

The file structure looks good to me.

alee commented 9 years ago

Is an Analysis a collection of Datasets and Scripts?

I was conceiving of the metadata standard being applied to Dataset instead of DataTable. Each DataTable is an instance of a Dataset where the schema level metadata actually lives on the Dataset. I'm not sure if that's sensible given our requirements though, we might need to redesign the back-end models.

cpritcha commented 9 years ago

An Analysis is a collection of Datasets and Scripts.

To compare what we have currently (with DataTable, Dataset) I think of a Dataset as a limited form of a container in the metadata standard (it contains a collection of DataTables).

I also think of Dataset as a way of tagging a DataTable (this table is part of this spreadsheet or this table metadata is the same for all these files because they contain the same columns) and Dataset should be optional (and DataTable should be connected directly to an Analysis instead of through a Dataset.

alee commented 9 years ago

Just thinking out loud but wouldn't it be better for consistency (and to avoid optional special-casing logic where we have to check every DataTable to see if it has a Dataset or if it's singular) that we tie Datasets directly to an Analysis instead?

The Dataset might be a single DataTable, or it might be multiple DataTables, and Datasets carry the schema information, e.g., all the columns and variable names etc. DataTables then are primarily a pointer to some location in the filesystem, and the schema + metadata of the DataTable would be always looked up in the Dataset. We then maintain this invariant that all DataTables within a Dataset share the same schema / structure.

This may be a little unintuitive, an Excel file with N sheets that all have different structures would be interpreted by our system as N Datasets with 1 DataTable each. But I think it will simplify our backend logic considerably to be able to support large single files that have a fixed schema, and multitudes of files that also share the same schema. Both of those would be Datasets, but the former is a singleton set and the latter is a composite set.

This always raises the issue that if we treat Excel files this way for instance, we would need to explode their sheets out into individual CSV files. But this may be a valuable process anyways for archival reasons since plaintext CSV is a more durable format than a .xls

cpritcha commented 9 years ago

That sounds like a great way to arrange the metadata. Let's do it.

To approach the problem in the last paragraph we could have JSON for each specific Dataset format. If the file was an Excel file then in the Dataset level metadata the JSON would contain what sheet the metadata is about. This would mean that all DataTables that are instances of the Dataset would have to have the same format (they are all pages in particular Excel files or all the LUXE output files that belong in a particular group are in the same file format).

warmdev commented 9 years ago

I agree with @alee. In fact, the dataset idea is exactly the same as container and containerType in my LUXE database example on Trello (and my understanding of Gary's original schema). The word dataset might be confusing in my opinion. Maybe dataTableGroup.

warmdev commented 9 years ago

Local file structure

Before uploading a project, the project files should be organized in the following manner and tested locally.

Assuming that your project name (on the miracle system) is luxe. All paths below are relative to the root folder of your project, which is the root of your project working directory in R or RStudio)

deployrInput('{ "name": "age", "label": "Age", "render": "integer", "default": 6 } ')
PROJECT_ROOT
│   script1.R
│   script2.R
│
├───data
│   └───luxe
│           data.csv
│
├───docs
│       readme.txt
│
├───luxe
│   ├───luxe_demo_app
│   │       server.R
│   │       ui.R
│   │
│   └───rmarkdown
│           paper.rmd
│
└───output
    └───luxe
            luxe_output.txt

Server file structure

Assuming that all data will be stored at /home/miracle/data, all scripts/apps/rmarkdown files will be stored at /home/miracle/apps, and an writable workspace is at /home/miracle/output (this folder can also be containerized)

warmdev commented 9 years ago

A potential solution to the problem above: instead of using project name in path, use a hash value instead. Project admin will know the hash value but nobody else does. This does not prevent any resourceful user from finding out the folder names though.

cpritcha commented 9 years ago

Great. I'll set up the paths for a development machine.

The local file structure for the Analysis (what you call a Project) looks good. The luxe subfolders seem redundant when we know we are in the luxe example. I would rather have all the scripts in a src folder. How about:

luxe
├───apps
│   ├───demo_app
│   │       server.R
│   │       ui.R
│   │
│   └───rmarkdown
│           paper.rmd
├───data
│       data.csv
├───docs
│       readme.txt
├───output
│       luxe_output.txt
└───src
       script1.R
       script2.R

Will hierarchy create a problem? Consider the example below. It seems to me that hierarchies would not be allowed because flattening would necessitate renaming the file which would make the deployr paths not work.

luxe
├───apps
│   ├───demo_app
│   │       server.R
│   │       ui.R
│   │
│   └───rmarkdown
│           paper.rmd
├───data
│   ├───census
│   │        data.csv
│   └───landuse
│            data.sqlite
├───docs
│       readme.txt
├───output
│       luxe_output.txt
└───src
       script1.R
       script2.R
warmdev commented 9 years ago

deployrExternal can contain path/to/data. Putting R scripts to src should be fine, since deployrExternal is relative to working directory of R sessions on local machines.

cpritcha commented 9 years ago

Great. @warmdev could you post your LUXE DeployR example on Trello in example datasets card?

warmdev commented 9 years ago

@cpritcha About the luxe subfolders, it is necessary and should not be removed. It is there so that we can mount data and RMarkdown/Shiny apps separately while still having the correct relative path. In my example, the relative path to be used in RMarkdown would be ../../data/luxe/data.csv, but in your example it will be ../../data/data.csv which is incorrect, as project name is missing and the code will not work on the server.

warmdev commented 9 years ago

@cpritcha The DeployR-compliant luxe example is now posted on Trello.

cpritcha commented 9 years ago

Great. I'll take a look.

cpritcha commented 9 years ago

Analysis Structure Wiki

Analysis Structure: Client Side

An analysis can have the following special folders

apps

The apps folder contains interactive applications for the Analysis. These applications are commonly Shiny or RMarkDown.

data

The data folder contains the all the input data pertaining to the Analysis.

docs

The docs folder contains all documentation for the Analysis.

output

The output folder contains all outputs from running a script or application. This folder is the folder that can contain files that change.

src

General code for the Analysis goes here. That includes libraries. Application and source for the Analysis must be self contained entities (an application cannot use code from the src folder for now).

Analysis Structure: Server Side

When an analysis is uploaded, it is broken up into its constituent folders.

Example: The LUXE Analysis

Suppose the LUXE analysis has the following structure:

luxe
.
├───data
│   └───luxe
│           data.csv
│
├───docs
│   └───luxe
│           readme.txt
│
├───apps
│   └───luxe
│       ├───demo_app
│       │       server.R
│       │       ui.R
│       │
│       └───rmarkdown
│               paper.rmd
│
├───output
│  └───luxe
│       output.txt
└── src
    └── luxe
        ├── script1.R
        └── script2.R

Then on the server the analysis broken into its folders matching the structure below

.
└── miracle
    ├── apps
    │   └── luxe
    │       ├── demo_app
    │       │   ├── server.R
    │       │   └── ui.R
    │       └── rmarkdown
    │           └── paper.rmd
    ├── data
    │   └── luxe
    │       └── data.csv
    ├── docs
    │   └── luxe
    │       └── readme.txt
    ├── output
    │   └── luxe
    │       └── output.txt
    └── src
        └── luxe
            ├── script1.R
            └── script2.R
warmdev commented 9 years ago

Clarification of DeployR file and graphics output behaviour:

"url": "http://localhost:7400/deployr/r/project/execute/result/download/PROJECT-5c5a8d8b-38b0-4dca-b46e-5da5dfbb0b04/EXEC-c6201b9b-cc29-4758-949d-cd682e6cb073/unnamedplot001.png"

In conclusion, for output graphics to be returned via API response, the code should either output to a display device or save it to a file with no path.

alee commented 8 years ago

I think we need to discuss the proposed filesystem structure in more depth @cpritcha and @warmdev I've cross-posted Calvin's formatted comment over in https://github.com/comses/miracle/wiki/Filesystem-Structure and continued refinement should happen there.

Given the miracle data store defined at settings.MIRACLE_DATA_DIRECTORY (let's say = /home/miracle to stick with the example) - why are you favoring splitting up the different pieces of the project into multiple named project directories? This becomes a greater burden to manage programmatically. It seems like it'd be much simpler to just have a project root directory

MIRACLE_DATA_DIRECTORY/<project-slug>/

to host the various output src data apps and docs subdirectories. Eventually we may want to distinguish between the uploaded original content and any derivative content that we create from that uploaded original content (to help us distinguish between the "submission information package" and the "archival information package" in OAIS terms). That could just be archival and submission subdirectories below the project root.

If the primary goal of this directory layout is to allow the tools we're integrating with to access them properly, I need some clear direction & documentation as to why that is necessary.

warmdev commented 8 years ago

The main reason is that the data folder will need to be mounted to the Docker containers. If we have data from all projects in one central data folder, then we can mount this central folder to the containers. If we have data folders inside project_slug folders, then we will need to mount the entire MIRACLE_DATA_DIRECTORY folder to the containers (note that Docker does not support dynamic folder mounting once the container is created).

This is not really a big problem by itself. But for DeployR to work properly, data input in scripts will need to be specified by deployrExternal("path/to/data") (reason at the end). When the script is run locally, deployrExternal looks in the current working directory to find the relative path for data; while when the script is run on the server, deployrExternal looks for the data in the public external directory (which is /home/deployr/deployr/7.4.1/deployr/external/data/public in the container - again the path may seem excessive here but they are just default values, the first deployr is the user name, second is the installation directory and the third is the deployr app name).

We need to ensure that scripts with deployrExternal works both on user's local machine as well as the server. Assuming the structure project_slug/data/data.csv and project_slug/src/script.R, and working directory is the project root folder - default behaviour in RStudio, then the relative path to data is data/data.csv. However, on the server, the path we need is project_slug/data/data.csv, otherwise we couldn't differentiate data from different projects.

Reason for deployrExternal: Currently it is not possible to mount scripts and data directly to the DeployR container, and have it run the scripts using relative path to data (instead of uploading the script through API like what we are doing). It is possible to upload both scripts and data files to a DeployR repository, but they are not stored as individual files; plus uploading big data is not practical.

alee commented 8 years ago

Ah ok. I think what I would suggest in this situation is to have so-called "derivative" directories. We should manage our own internal representation of our collected datasets & scripts for our own ease of use. We can generate a "derivative" representation of the collected datasets & scripts for use by DeployR, Radiant, etc., in a mirrored filesystem once they are ready to go.

In that sense we shouldn't let the way the deployr and radiant docker images need to be able to access their inputs dictate our own internal architecture - that should be designed in such a way to facilitate easy management and dissemination. If we have an easy-to-use, reason about, and modify internal architecture and filesystem structure we can then generate the appropriate filesystem structures for whatever new technologies or frameworks we decide to adopt or integrate.

cpritcha commented 8 years ago

I think a temporary folder for uploaded project archives would be a good idea. When a project archive gets uploaded we have to

  1. Extract the archive to a folder
  2. Extract any metadata from the folder
  3. Show the end user the metadata and allow them to regroup DataTables into Datasets
  4. Save the users edited metadata in the database
  5. Move the special project folders to the to the MIRACLE_DATA_DIRECTORY
  6. Delete anything remaining in the extracted project archive folder
  7. Move the archive to the MIRACLE_DATA_DIRECTORY

Preferably the whole process would be atomic so we don't end up with a moved project but no metadata or vice versa. The temporary folder would be a easy way to keep projects that haven't been completely added yet away from the ones that have and to know if a project's extraction process is leaving behind remnants.

alee commented 8 years ago

Closing, see https://github.com/comses/miracle/wiki/Project-Archive-Preparation-Guidelines and https://github.com/comses/miracle/wiki/Filesystem-Structure for details.