Closed alee closed 8 years ago
@warmdev please feel free to chime in here too. Thanks!
I was thinking that projects
would not be a part of the file system. Instead, projects exist only in the database as a way of grouping analyses and only analyses are stored in the filesystem (maybe what you are calling a project is what I think of as an analysis?).
The first point about datatables
being one to one with files I think we should change. In the LUXE example, many of the output csv files should have the same metadata. In order to that we need the relationship to be one to many with the files (we could do this by another table with datatable
ids and filenames). Then datasets
in the model would refer to groups of datatables
(that do not share metadata).
The file structure looks good to me.
Is an Analysis
a collection of Datasets
and Scripts
?
I was conceiving of the metadata standard being applied to Dataset
instead of DataTable
. Each DataTable is an instance of a Dataset
where the schema level metadata actually lives on the Dataset
. I'm not sure if that's sensible given our requirements though, we might need to redesign the back-end models.
An Analysis
is a collection of Datasets
and Scripts
.
To compare what we have currently (with DataTable
, Dataset
) I think of a Dataset
as a limited form of a container in the metadata standard (it contains a collection of DataTable
s).
I also think of Dataset
as a way of tagging a DataTable
(this table is part of this spreadsheet or this table metadata is the same for all these files because they contain the same columns) and Dataset
should be optional (and DataTable
should be connected directly to an Analysis
instead of through a Dataset
.
Just thinking out loud but wouldn't it be better for consistency (and to avoid optional special-casing logic where we have to check every DataTable to see if it has a Dataset or if it's singular) that we tie Datasets directly to an Analysis instead?
The Dataset might be a single DataTable, or it might be multiple DataTables, and Datasets carry the schema information, e.g., all the columns and variable names etc. DataTables then are primarily a pointer to some location in the filesystem, and the schema + metadata of the DataTable would be always looked up in the Dataset. We then maintain this invariant that all DataTables within a Dataset share the same schema / structure.
This may be a little unintuitive, an Excel file with N sheets that all have different structures would be interpreted by our system as N Datasets with 1 DataTable each. But I think it will simplify our backend logic considerably to be able to support large single files that have a fixed schema, and multitudes of files that also share the same schema. Both of those would be Datasets
, but the former is a singleton set and the latter is a composite set.
This always raises the issue that if we treat Excel files this way for instance, we would need to explode their sheets out into individual CSV files. But this may be a valuable process anyways for archival reasons since plaintext CSV is a more durable format than a .xls
That sounds like a great way to arrange the metadata. Let's do it.
To approach the problem in the last paragraph we could have JSON for each specific Dataset
format. If the file was an Excel file then in the Dataset
level metadata the JSON would contain what sheet the metadata is about. This would mean that all DataTable
s that are instances of the Dataset
would have to have the same format (they are all pages in particular Excel files or all the LUXE output files that belong in a particular group are in the same file format).
I agree with @alee. In fact, the dataset
idea is exactly the same as container
and containerType
in my LUXE database example on Trello (and my understanding of Gary's original schema). The word dataset
might be confusing in my opinion. Maybe dataTableGroup
.
Before uploading a project, the project files should be organized in the following manner and tested locally.
Assuming that your project name (on the miracle
system) is luxe
. All paths below are relative to the root folder of your project, which is the root of your project working directory in R or RStudio)
data/luxe
folderdeployrExternal('data/luxe/file_name')
deployrPackage("package_name")
deployrInput
. For example:deployrInput('{ "name": "age", "label": "Age", "render": "integer", "default": 6 } ')
luxe/rmarkdown
folder. Referencing your data in RMarkdown using relative path.luxe/app_name
folder. Referencing your data in Shiny apps using relative pathoutput/luxe
folder, be referenced in R scripts as deployrExternal('output/luxe/output_file')
, and be referenced in RMarkdown/Shiny apps using relative pathdocs
folderPROJECT_ROOT
│ script1.R
│ script2.R
│
├───data
│ └───luxe
│ data.csv
│
├───docs
│ readme.txt
│
├───luxe
│ ├───luxe_demo_app
│ │ server.R
│ │ ui.R
│ │
│ └───rmarkdown
│ paper.rmd
│
└───output
└───luxe
luxe_output.txt
Assuming that all data will be stored at /home/miracle/data
, all scripts/apps/rmarkdown files will be stored at /home/miracle/apps
, and an writable workspace is at /home/miracle/output
(this folder can also be containerized)
/home/miracle/data
(so that the path to a data file will be /home/miracle/data/luxe/file_name
). The /home/miracle/data
folder will be mounted to the containers:
DeployR
container as /home/deployr/deployr/7.4.1/deployr/external/data/public/data
Shiny
and Radiant
containers as /srv/shiny-server/data
DeployR
container via API/home/miracle/apps
(so that the path to an RMarkdown file would be /home/miracle/apps/luxe/rmarkdown/paper.rmd, and path to a Shiny app would be
/home/miracle/apps/luxe/app_name/ui.R.
/home/miracle/appsfolder will be mounted to the
Shinycontainer as
/srv/shiny-server`/home/miracle/output
will be mounted to the DeployR
server as /home/deployr/7.4.1/deployr/external/data/public/output
, and to the Shiny
container at /srv/shiny-server/output
(NOTE this is not foolproof, as a script/app could potentially write to other projects' output folder. Please suggest ideas. Docker does not support dynamically adding data volumes at the moment, so we cannot add project specific output folder on demand)A potential solution to the problem above: instead of using project name in path, use a hash value instead. Project admin will know the hash value but nobody else does. This does not prevent any resourceful user from finding out the folder names though.
Great. I'll set up the paths for a development machine.
The local file structure for the Analysis
(what you call a Project
) looks good. The luxe
subfolders seem redundant when we know we are in the luxe
example. I would rather have all the scripts in a src
folder. How about:
luxe
├───apps
│ ├───demo_app
│ │ server.R
│ │ ui.R
│ │
│ └───rmarkdown
│ paper.rmd
├───data
│ data.csv
├───docs
│ readme.txt
├───output
│ luxe_output.txt
└───src
script1.R
script2.R
Will hierarchy create a problem? Consider the example below. It seems to me that hierarchies would not be allowed because flattening would necessitate renaming the file which would make the deployr paths not work.
luxe
├───apps
│ ├───demo_app
│ │ server.R
│ │ ui.R
│ │
│ └───rmarkdown
│ paper.rmd
├───data
│ ├───census
│ │ data.csv
│ └───landuse
│ data.sqlite
├───docs
│ readme.txt
├───output
│ luxe_output.txt
└───src
script1.R
script2.R
deployrExternal
can contain path/to/data
. Putting R scripts to src
should be fine, since deployrExternal
is relative to working directory of R sessions on local machines.
Great. @warmdev could you post your LUXE DeployR example on Trello in example datasets card?
@cpritcha About the luxe
subfolders, it is necessary and should not be removed. It is there so that we can mount data
and RMarkdown/Shiny apps separately while still having the correct relative path. In my example, the relative path to be used in RMarkdown would be ../../data/luxe/data.csv
, but in your example it will be ../../data/data.csv
which is incorrect, as project name is missing and the code will not work on the server.
@cpritcha The DeployR-compliant luxe example is now posted on Trello.
Great. I'll take a look.
An analysis can have the following special folders
apps
The apps
folder contains interactive applications for the Analysis
. These applications are commonly Shiny or RMarkDown.
data
The data
folder contains the all the input data pertaining to the Analysis
.
docs
The docs
folder contains all documentation for the Analysis
.
output
The output
folder contains all outputs from running a script or application. This folder is the folder that can contain files that change.
src
General code for the Analysis
goes here. That includes libraries. Application and source for the Analysis
must be self contained entities (an application cannot use code from the src
folder for now).
When an analysis is uploaded, it is broken up into its constituent folders.
Suppose the LUXE analysis has the following structure:
luxe
.
├───data
│ └───luxe
│ data.csv
│
├───docs
│ └───luxe
│ readme.txt
│
├───apps
│ └───luxe
│ ├───demo_app
│ │ server.R
│ │ ui.R
│ │
│ └───rmarkdown
│ paper.rmd
│
├───output
│ └───luxe
│ output.txt
└── src
└── luxe
├── script1.R
└── script2.R
Then on the server the analysis broken into its folders matching the structure below
.
└── miracle
├── apps
│ └── luxe
│ ├── demo_app
│ │ ├── server.R
│ │ └── ui.R
│ └── rmarkdown
│ └── paper.rmd
├── data
│ └── luxe
│ └── data.csv
├── docs
│ └── luxe
│ └── readme.txt
├── output
│ └── luxe
│ └── output.txt
└── src
└── luxe
├── script1.R
└── script2.R
Clarification of DeployR
file and graphics output behaviour:
DeployR
they are automatically saved as png
files, and API response contains "url": "http://localhost:7400/deployr/r/project/execute/result/download/PROJECT-5c5a8d8b-38b0-4dca-b46e-5da5dfbb0b04/EXEC-c6201b9b-cc29-4758-949d-cd682e6cb073/unnamedplot001.png"
./home/deployr/deployr/7.4.1/rserve/workdir/Rserv7.4/conn2/figure1.png
, and API response also contains URL to file.
deployrExternal
, file is saved at the public folder using the specified path, but no URL is returned.In conclusion, for output graphics to be returned via API response, the code should either output to a display device or save it to a file with no path.
I think we need to discuss the proposed filesystem structure in more depth @cpritcha and @warmdev I've cross-posted Calvin's formatted comment over in https://github.com/comses/miracle/wiki/Filesystem-Structure and continued refinement should happen there.
Given the miracle data store defined at settings.MIRACLE_DATA_DIRECTORY (let's say = /home/miracle
to stick with the example) - why are you favoring splitting up the different pieces of the project into multiple named project directories? This becomes a greater burden to manage programmatically. It seems like it'd be much simpler to just have a project root directory
MIRACLE_DATA_DIRECTORY/<project-slug>/
to host the various output
src
data
apps
and docs
subdirectories. Eventually we may want to distinguish between the uploaded original content and any derivative content that we create from that uploaded original content (to help us distinguish between the "submission information package" and the "archival information package" in OAIS terms). That could just be archival
and submission
subdirectories below the project root.
If the primary goal of this directory layout is to allow the tools we're integrating with to access them properly, I need some clear direction & documentation as to why that is necessary.
The main reason is that the data
folder will need to be mounted to the Docker containers. If we have data from all projects in one central data
folder, then we can mount this central folder to the containers. If we have data
folders inside project_slug
folders, then we will need to mount the entire MIRACLE_DATA_DIRECTORY
folder to the containers (note that Docker does not support dynamic folder mounting once the container is created).
This is not really a big problem by itself. But for DeployR
to work properly, data input in scripts will need to be specified by deployrExternal("path/to/data")
(reason at the end). When the script is run locally, deployrExternal
looks in the current working directory to find the relative path for data; while when the script is run on the server, deployrExternal
looks for the data in the public external
directory (which is /home/deployr/deployr/7.4.1/deployr/external/data/public
in the container - again the path may seem excessive here but they are just default values, the first deployr
is the user name, second is the installation directory and the third is the deployr
app name).
We need to ensure that scripts with deployrExternal
works both on user's local machine as well as the server. Assuming the structure project_slug/data/data.csv
and project_slug/src/script.R
, and working directory is the project root folder - default behaviour in RStudio, then the relative path to data is data/data.csv
. However, on the server, the path we need is project_slug/data/data.csv
, otherwise we couldn't differentiate data from different projects.
Reason for deployrExternal
: Currently it is not possible to mount scripts and data directly to the DeployR
container, and have it run the scripts using relative path to data (instead of uploading the script through API like what we are doing). It is possible to upload both scripts and data files to a DeployR
repository, but they are not stored as individual files; plus uploading big data
is not practical.
Ah ok. I think what I would suggest in this situation is to have so-called "derivative" directories. We should manage our own internal representation of our collected datasets & scripts for our own ease of use. We can generate a "derivative" representation of the collected datasets & scripts for use by DeployR, Radiant, etc., in a mirrored filesystem once they are ready to go.
In that sense we shouldn't let the way the deployr and radiant docker images need to be able to access their inputs dictate our own internal architecture - that should be designed in such a way to facilitate easy management and dissemination. If we have an easy-to-use, reason about, and modify internal architecture and filesystem structure we can then generate the appropriate filesystem structures for whatever new technologies or frameworks we decide to adopt or integrate.
I think a temporary folder for uploaded project archives would be a good idea. When a project archive gets uploaded we have to
DataTables
into Datasets
MIRACLE_DATA_DIRECTORY
MIRACLE_DATA_DIRECTORY
Preferably the whole process would be atomic so we don't end up with a moved project but no metadata or vice versa. The temporary folder would be a easy way to keep projects that haven't been completely added yet away from the ones that have and to know if a project's extraction process is leaving behind remnants.
Design a filesystem organization scheme for uploaded analyses. Here's a first cut:
All uploaded resources must belong to a Project and are assumed to be placed in
settings.FILESTORE_LOCATION/projects/<project_id>/
as the base root (referred to asPROJECT_ROOT
from here on out.PROJECT_ROOT/datasets/<dataset_id>/
and can consist of a single or collection of DataTables. There is a 1:1 correspondence between DataTables and files on the filesystem.PROJECT_ROOT/scripts/<script_id>/
and are singular.PROJECT_ROOT
to the Docker instance running DeployR we'll be able to sandbox access to the script and just deal with things in a relative fashion using the unique IDs that our system generated for the datasets / scripts.Any feedback / modifications welcome. Once this is finalized we should put it up on the wiki.