LivingNorway / TheDataPackage

Template for data archive structure and suggestive workflow
Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

Build metadata file(s) #3

Closed DrMattG closed 4 years ago

DrMattG commented 4 years ago

Build metadata file(s) - maybe we should think about the possibility for generating one plain text file for metadata (e.g. .md, .txt, or .rtf - I would personally prefer .md) and one meta.xml file in EML? Rational; The .xml file is machine readable and what goes into a DwC-A, but impossible for human consumption without translation. It would be cool to be able to build metadata based upon information in files (possible from templates coming from DataEntryForms) directly - would save some time indeed, but could come at some later stage.

andersfi commented 4 years ago

As usual, there is no such thing as a new great idea - see the EMLassemblyline project for similar functionality as suggested here...

DrMattG commented 4 years ago

Great we can make use of that - it seems well maintained too which is useful. We can lean heavily on this and the other EML packages. R opensci have a couple of packages that do aspects of what we need too but we need to weave them all together in to a single (ish) workflow. Their EMLdown package is particularly cool (https://ropensci.org/blog/2017/08/01/emldown/ )

On Thursday, March 5, 2020, Anders G. Finstad notifications@github.com wrote:

As usual, there is no such thing as a new great idea - see the EMLassemblyline project https://github.com/EDIorg/EMLassemblylinehttps://github.com/EDIorg/EMLassemblyline for similar functionality as suggested here...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LivingNorway/TheDataPackage/issues/3?email_source=notifications&email_token=ACYB2FI5OLUTWIEANV5D2CTRGADSRA5CNFSM4LCEL2A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN6Y7OI#issuecomment-595431353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYB2FNHEGJH4XFCSDYSYDLRGADSRANCNFSM4LCEL2AQ .

andersfi commented 4 years ago

Suggest that as a first step, the "build_folder_structure" function creates basic folder structure and empty metadata files (that can be manually edited or edited by a call to a shiny app for graphical help). May streamline later, but this would the project up and running more quickly?

ErlendNilsen commented 4 years ago

It would be useful to draw out a typical workflow as well - perhaps in a rmd-document? When starting a new (data) project, first set up folder structure, then metadata and DMP etc.

andersfi commented 4 years ago

Yes, currently - this is indicated loosely in the metadata of the repro.

ErlendNilsen commented 4 years ago

Would be nice to start drawing out the workflow explicitly.

DrMattG commented 4 years ago

I can add that as a template in the rmarkdown templates folder (it only has a minimum metadata template at the moment) - Perhaps so that it adds to the topmost level of the project folder a README/Instructions file

ErlendNilsen commented 4 years ago

I was thinking of a working doc for us when we start building the functions - to see better how the functions / workflow we build fits together. But could be included as a template as well for the users to fill out - that could be a good idea actually.

softloud commented 4 years ago

Would a graph of the workflow be useful? I've been teaching myself graphing techniques to describe workflows in a few packages I'm developing. The graphs are rough, could be prettier but can be easily updated as they are created with a couple of tables that 1) describe the functions 2) describe from and to connections. Update the tables, and the graph automatically updates, no fiddling around with placing things in flowcharts. For example:

image

ErlendNilsen commented 4 years ago

This looks indeed very useful!

DrMattG commented 4 years ago

@softloud what do you use to make these networks - is it igraph?

softloud commented 4 years ago

Ooh good question. I experimented with a few things.

This is a combination of tidygraph:: and ggraph:: packages, which extend from ggplot::. I liked the set up for this. Create two dataframes: nodes and edges. In the nodes, any additional descriptors (in this case I wanted to differentiate between object types), and the edges describe from and to for the nodes. The edges need to match the node names or it'll bork out.

Nice thing is the graph layout is automated, more nodes can be added, more edges, etc., and the graph will update. ggplot:: syntax allows for different colouring, shapes, etc.

I recall I tried igraph:: but didn't find it as intuitive to set up.

library(tidygraph)
#> 
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(ggraph)
#> Loading required package: ggplot2
library(tidyverse)

nodes <- 
  tribble(
    ~object_name, ~object_type,
    "raw_claim_data", "dataframe" ,
    "preprocess_judgements", "function",
    "aggregate_cs", "function",
    "output", "dataframe",
    "preprocess_QuizWAgg", "function",
    "quizscores", "dataframe",
    "preprocess_ReasonWAgg", "function",
    "reasoning", "dataframe",
    "raw_reasoning", "dataframe",
    "priors", "dataframe",
    "qualtrics_path", "filepath",
    "get_quiz_scores", "function",
    "quiz_scores", "dataframe",
    "quiz_rubric", "dataframe"
  )  %>% 
  mutate(id = row_number())

node_key <- function(object_name) {
  nodes %>% 
    dplyr::filter(object_name == !!object_name) %>% 
    pluck("id")
}

edges <- 
  tribble(
    ~from, ~to,
    "raw_claim_data", "preprocess_judgements",
    "preprocess_judgements", "aggregate_cs",
    "aggregate_cs", "output",
    "preprocess_QuizWAgg", "quizscores",
    "reasoning", "aggregate_cs",
    "quizscores", "aggregate_cs",
    "raw_reasoning", "preprocess_ReasonWAgg",
    "preprocess_ReasonWAgg", "reasoning",
    "reasoning", "aggregate_cs",
    "priors", "aggregate_cs",
    "qualtrics_path", "get_quiz_scores",
    "get_quiz_scores", "quiz_scores",
    "quiz_scores", "preprocess_QuizWAgg",
    "quiz_rubric", "preprocess_QuizWAgg"
  ) %>%
  mutate(from = map_int(from, node_key),
         to = map_int(to, node_key))

tbl_graph(nodes %>% select(-id), edges) %>% 
  ggraph() +
  geom_edge_link(arrow = arrow(), colour = "lightgrey") + 
  geom_node_text(
    size = 2.5,
    aes(label = object_name, colour = object_type)) +
  theme_graph() +
  hrbrthemes::scale_color_ipsum() +
  theme(legend.position = "bottom")
#> Using `sugiyama` as default layout

Created on 2020-05-04 by the reprex package (v0.3.0)