Welcome to "Research Data Management in the Energy Sector"! This course teaches all skills necessary to understand the principles and motivation behind Research Data Management (RDM) and enables you to implement RDM in your work and research group.
The course focuses on applicability in the energy sector.
There are three possible ways to work with this course:
A quick word on the course format. The course is written in Markdown and implemented in LiaScript. In the upper right corner, you can switch between textbook mode, presentation mode and slides. You can have the course read aloud by clicking on the symbol at the bottom of the page in presentation mode. The best way for a stepwise navigation through the course is via the arrows at the bottom, because there might be subpages for several topics that are not displayed on the left sidebar. If you want to adapt the course for your use, you may do so by opening a branch on GitHub or downloading individual files.
We want to show two examples to showcase where good research data management can support you in your research and prevent problems.
When a young researcher started on his PhD, he was told to work on unpublished data that had been collected 3 years prior. He received various folders that were full of data. After going through them, he found that there were several datasheets with duplicate names but different contents, scripts that nobody knew what they did or why and column names that were unclear and ambiguous. Moreover, the exact equipment and/or settings used for the experiments were unknown. Since several years had passed, not even extensive talks with the manufacturers of the used equipment or the data authors could make the data usable. In the end, the data could not be re-used.
Source: FDM Thüringen: Scarytales, Licensed under CC-BY 4.0
The Mars Climate Orbiter (MCO) was part of a NASA program to gather information on Mars. On December 11, 1998 the MCO was launched. The aim was for MCO to circuit Mars in a spherical orbit and perform measurements regarding the atmosphere and the climate of the planet. However, there was a problem during the maneuver and the space probe came too close to Mars and was lost.
The navigation problems resulted from the use of different units for calculations by the involved institutions. While the navigations team used the metric system, Lockheed Martin Astronautics, the American company that had produced the probe, used Anglo-American units of measurement. The conversion of the units (e.g. Newton-seconds vs. pound-seconds) was not always taken into account, thus leading to errors in course corrections.
In fact, NASA had made it clear in its "software interface specification" that the metric system should be used. The course correction program SM_FORCES by Lockhead Martin Aeronautics was not written in accordance with the official specifications and caused the loss of the spacecraft.
Source: FDM Thüringen: Scarytales, Licensed under CC-BY 4.0
While maybe not as critical for society at large, scientists can face similar problems when doing their research: data are insufficiently labeled, have been overwritten, commercial computer programs have been discontinued, standards were not used, or process details were not recorded.
Research Data Management (RDM) aims to break this dynamic by ensuring a sustainable and coherent strategy for all data types throughout the research process. This enables researchers to store, access, and re-use their own work effectively and safely as well as open their findings worldwide to improve on cross-disciplinary collaboration, monitoring, and replication.
RDM includes all activities associated with processing, storage, preserving, and publication of research data.
{{0-1}}
{{0-1}}
The "Research Data Lifecycle" describes a cyclic process of data generation and management. First, data is created, then processed. The next step is analyzing the data, followed by preserving the results. Now, it is time to give access to the data. Ideally, the cycle closes at this point: Data is being re-used and new, well-documented data is being created.
{{0-1}}
Exercise: Look at each step of the cycle with your data in mind:
- Which steps do you actively apply in your research process?
- Which tools or software do you use at each step?
- How well is the data protected against data loss? Where are possible flaws where data or information could get lost?
- Does the cycle "close"? Which repositories or platforms do you use to store your results?
{{0-1}}
{{1-2}}
Benefits of RDM in your projects
{{1-2}}
Are you frustrated with the time you invest in searching for data? Do you want to apply structured RDM cost-effectively? But in the moment, you lack the expertise, resources, or incentives to share your data with your group and in your field? Perhaps you worry whether your data is transferable at all because some data have ethical or epistemological restrictions or your project includes many stakeholders with competing interests in your project? In this chapter, we will prove to you that RDM is not as tricky as you think.
{{1-2}}
{{1-2}}
If done right, RDM will...
{{1-2}}
...save time, resources, effort and money:
...improve scientific impact:
...help to prevent errors and enhance data security:
...show compliance with research obligations of your institution and funder:
{{2-3}}
{{3}}
It's Quiz Time
{{3}}
RDM helps preserve time that is otherwise lost while:
{{3}} [[ ]] Collecting data (M) [[ ]] Analyzing data (A) [[x]] Searching for, recovering, and deciphering data (W) [[ ]] Archiving data (G)
{{3}}
One of the benefits of RDM is making data:
{{3}} [[ ]] Irreversible (F) [[ ]] Obsolete (R) [[x]] Reusable (O) [[ ]] Inaccessible (S)
{{3}}
RDM enables researchers to benefit from high-quality datasets from:
{{3}} [[ ]] Social media platforms (C) [[x]] Other researchers (R) [[ ]] Government agencies (L) [[ ]] Non-profit organizations (E)
{{3}}
RDM can influence research developments:
{{3}} [[ ]] Immediately after the original research is completed (T) [[ ]] Only within the same discipline (U) [[x]] Continually, long after the original research is completed (L) [[ ]] Only across disciplines (D)
{{3}}
One way RDM helps prevent errors is by:
{{3}} [[ ]] Encrypting data (S) [[ ]] Backing up data (Q) [[x]] Synchronizing records (D) [[ ]] Archiving data (Y)
{{0-1}}
Open Science strives to make scientific research and its dissemination accessible to all levels of society. It is based on the four principles of transparency, reproducibility, reusability and open communication. Open Science can encompass publishing open research, campaigning for open access, encouraging scientists to practice open-notebook science (such as openly sharing data and code), broader dissemination and engagement in science and generally making it easier to publish, access and communicate scientific knowledge (https://ag-openscience.de/open-science/).
{{0-1}}
Exercise:
Find out if your institution has an Open Science or Research Data Policy.
What is its scope? What is regulated and how?
If not: Would you like to have a Research Data Policy? What content should it have?
The main idea behind the FAIR principles is: As open as possible, as closed as necessary! The acronym stands for
Fair Guiding Principles by A. Ahrens. Licensed under [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/deed.en), Source: https://www.go-fair.org/fair-principles/ > **It's Quiz Time**: > > What can you do to make your data more FAIR? > > [[X]] use Creative-Commons or GNU licenses (S) > [[ ]] keep processing details undisclosed (I) > [[ ]] ensure access security (A) > [[X]] create detailed Metadata (B) > [[X]] ensure long-term accessibility in repositories (E) ## Data Management Plan {{0-1}} The **Data Management Plan (DMP)** contains all information that describes and documents sufficiently the collection, processing, storage, archiving and publication of research data within a research project. {{0-1}} >**Exercise**: > > Watch the following video and answer the questions that pop up. {{0-1}} {{1-2}} Many public funding organizations require a DMP prior to granting funds for research projects, thus making DMPs an integral part of the scientific process, especially in data-intensive research fields such as the energy sector. {{1-2}} | Funder | Plan demanded? | Content | Updates? | |:-------|:--------------------------|:--------|:---------------| |Horizon Europe| Data Management Plan|see [Horizon Europe Online Manual](https://op.europa.eu/en/web/eu-law-and-publications/publication-detail/-/publication/9570017e-cd82-11eb-ac72-01aa75ed71a1)|Yes| |[German Research Foundation (DFG)](https://www.dfg.de) |Information on the handling of research data|[DFG Guidelines on the Handling of Research Data](https://www.dfg.de/en/research_funding/principles_dfg_funding/research_data/), Checklist|No| |[German Ministry for Economic Affairs and Climate Action (BMWK)](https://www.bmwk.de/)| Research Data Exploitation Plan|scientific and economic potential, connectivity and transferability|once a year| |[German Ministry of Education and Research (BMBF)](https://www.bmbf.de/bmbf/de/forschung/digitale-wirtschaft-und-gesellschaft/aktionsplan-forschungsdaten/aktionsplan-forschungsdaten_node.html) |Plan sometimes required|Content depends on the respective program|| |[German Ministry for Digital and Transport (BMDV)](https://bmdv.bund.de)|Research Data Exploitation Plan sometimes required |Content depends on the respective program|| {{0-1}} A completed RDM has been created for the project: "Self-healing flexibility aggregation". The project is about the application of distributed artificial intelligence in energy system research. Multiple distributed energy resources are combined into a self-organized, controllable flexibility pool that offers aggregated flexibility to the grid operator. The plan was created using [RDMO](https://rdmo.aip.de), which provides the creation of a DMP supporting multiple templates. Exemplary DMPs in various templates, a description of the project and the templates themselves can be found [here](https://cloud.uol.de/s/dXsWFGc48qjenPg). ## Infrastructure To better handle research data, specific tools and services are needed. On the one side, they can support you in the handling of research data and software. On the other side, they can provide places to store data and/or metadata. Within this chapter, you can learn more about different existing tools and resources. In Germany, the German National Research Data Infrastructure is a huge initiative supported by the states and the federal government which wants to improve and further develop the infrastructure for research data. If you want to learn more about the NFDI in general, check out [this video](https://www.youtube.com/watch?v=uJ01g9m8uE4). For the energy domain, NFDI4Energy is the responsible consortium. You can check out their latest developments on [their website](https://nfdi4energy.uol.de/). ### Tools {{0-1}} Tools can help you to... {{0-1}} - comply with data management requirements by providing guidance and templates for data management planning. They facilitate data sharing and preservation. - organize, analyze and visualize your data in order to make it easier to draw insights and conclusions from your research. Many research projects generate large and complex datasets that can be difficult to manage without the right tools. - facilitate collaboration by allowing multiple researchers to access and analyze data. - ensure accuracy and reliability of your data by providing features such as data validation, version control and audit trails. - long-term preserve your data with backups, archives, and structured metadata. --{{1}}-- The specific features and capabilities of research data management tools can vary widely depending on the tool and the intended use case. {{1-2}} Tools for RDM can be divided into three categories. Internal DM tools like [Coscine](https://www.coscine.de/) and [IndiScale](https://www.indiscale.com/) enable RDM within your group or organization, while tools like [Zenodo](https://zenodo.org/) support you in sharing research data with other people and organizations. DMP tools support you in setting up and maintaining your data management plan. {{1-2}}| Tools for internal DM: | Tools to share and register your research data: | DMP Tools | | :--------- | :--------- | :--------- | | [Coscine](https://www.coscine.de/) | [Zenodo](https://zenodo.org/) | [RDMO](https://rdmorganiser.github.io/en/) | | [IndiScale](https://www.indiscale.com/) | [Open Energy Platform](https://openenergy-platform.org/) | [DMPOnline](https://dmponline.dcc.ac.uk/) | | |[GitHub](https://github.com/)/[GitLab](https://about.gitlab.com/) | [DMPTooI](https://dmptool.org/) | {{1-2}} You can also create a DMP in a simple spreadsheet document (e.g. Excel). The tools above provide additional help by letting you choose a set of questions that you want to follow in your DMP. If you want to take a look at an example DMP, you can look [here](img/DMP_GridCapacity.pdf). We recommend to use RDMO or DMPOnline due to their broad spectrum of available questionnaires. {{2-3}} > **DMP Task**: > > Now it is time to start your DMP! Choose a **tool** and a set of **questions** that you want to work with and enter your basic project parameters: > >1. Title and Research Question >2. Project Partners and Institutions > > Depending on your settings, you can add other details required by the respective questionnaire. For a quick start, we recommend [DMPOnline](https://dmponline.dcc.ac.uk/) here. {{2-3}} Congratulations! You have started your first DMP! {{3}} **It`s Quiz Time!** {{3}} ### Data Types {{0-1}} Depending on your field of expertise, there will be different data types relevant to your work. In your DMP, you can specify individual datasets and name their identifiers, file size, origin and so forth. {{0-1}} The data types you choose to publish have consequences for your Research Data Management in regard to the "Accessible", "Interoperable" and "Reusable" criteria of the FAIR Guidelines. {{0}} **1. Accessible** --> Make sure that (meta)data are long-term accessible. {{0-1}} The amount of memory space needed for long-term, safe storage may vary greatly between data types. For example, if your reasearch data consist of high-resolution images of potential wind turbine sites, memory space needed for backup and repositories will be much higher than a handful of CSV files with projected power loads. {{0}} **2. Interoperable** --> Use tools and software that are FAIR themselves {{0-1}} Strive to choose, where possible, programs and data types that are open and FAIR. For example, do not describe your processing details in a .pdf file, but rather use a .txt-file that can be read and edited with various programs. {{0}} **3. Reusable** --> Know and meet community standards {{0-1}} Which data types are most likely to be found, easily reused and shared in your scientific community? If these are "closed", you could provide two files: one that adheres to the common standards and one more open alternative that promotes the idea of open science. {{1-2}} > **Exercise:** > > Make a list of the data types commonly used in your working group and their "FAIR value". Use the criteria in the table below. {{1-2}} | FAIR value| Machine Readability | Human Readability | Long-time Stability| Metadata | Example| | :--------- | :---------| :--------- | :--------- | :---------|:-----------| |very good |with common open software |yes and without special software |normed standard |completely preserved | .csv| |good |with common and well-documented software | compressed with standard procedures, but practically yes |longterm or widely established | technical details are provided| | |moderate | proprietary standard format | with open software (reliably?) convertible to higher class| relatively new format | some important ones (e.g. units) are included | MATLAB file| |bad | self-developed reading software| no| newly developed | no information | .xlsx| {{1-2}} Source: translated and adapted from [forschungsdaten.info](https://forschungsdaten.info/themen/veroeffentlichen-und-archivieren/formate-erhalten/), [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en) {{2}} >**DMP Task**: > > If you already know datasets that you will be (re-)using or producing, enter them in your DMP as thoroughly as possible at this point. > > If you do not know your datasets yet, enter one or two "dummy" datasets and follow the respective questionnaire to get an idea of the amount of detail required in the DMP. ### Software and Simulation {{0-1}} Simulations and software programs are a special case. Here, not only the resulting datasets are of scientific value but also the programs, algorithms and settings leading up to the data. Research programs and simulation software are essential in energy research. Some researchers are very reluctant in sharing this part of their research work, sometimes because of economic considerations. {{0-1}} The German Research Alliance (DFG) adresses software development in a research context in its [Guidelines for Safeguarding Good Research Practice](https://www.dfg.de/en/research_funding/principles_dfg_funding/good_scientific_practice/index.html), stating that "software programmed by researchers is made publicly available along with the source code". The source code must be persistent, citable and documented. Certain individual exceptions are possible, though. {{0-1}} So, whether you plan to publish some, all or none of your self-developed software, you have to document well, plan ahead and ensure a good access strategy in your development team. {{0-1}} - Versioning: Various cloud programming platforms such as [GitLab](https://about.gitlab.com/de-de/) ensure a consistent preservation of all of your software versions, make contributions transparent and simplify collaboration. - Documentation: Programs should be documented and commented, e.g. by stating origin, purpose and scope of a program, limitations of variables and ideally a short manual. - Publication: With journal publication of an article, corresponding software should be cited with version number and a persistent identifier such as a Digital Object Identifier (DOI). Choose a software archive with version control. Sometimes, timely open access is not possible. In this case, at least the algorithm used must be outlined completely. If the source code is not published, state the reason and potentially the time period until release. {{1}} __Simulation__ --{{1}}-- With regard to simulation, we are presented with some additional challenges. {{1-2}} The longevity of simulation outputs is harder to assess than that of observational data. In a paper from 2021 on Open Science in Earth System Modelling, [Mullendore et. al. (2021)](https://doi.org/10.3389/fclim.2021.763420) identify the following problems: {{1-2}} - Simulations can generate massive output. - Interdependencies between hardware and software can limit the portability of models and make the longterm accessibility of their output problematic. Models in many cases involve interconnections between community models, open source software components, and custom code written to investigate particular scientific questions. - A lack of standardization and documentation for models and their output makes it difficult to achieve the goals of open and FAIR data initiatives. {{1-2}}The following strategies may be applied when working with simulations: {{1-2}} - Analyze your project: does it focus on **knowledge** or **data production**? "Most scientific research projects are undertaken with the main goal of knowledge production (e.g., running an experiment with the goal of publishing research findings). Other projects are designed and undertaken with the specific goal of data production, that is, they produce data with the intention that those data will be used by others to support knowledge production research." [(Mullendore et. al. (2021))](https://doi.org/10.3389/fclim.2021.763420) - Publish all elements relevant to the simulation, not just source code. - Publish enough output data to evaluate and reuse your findings, but not all simulation runs. - Some software may be released openly while others remain restricted due to security or proprietary concerns. In this case, the documentation should provide enough information to reproduce the process. {{1-2}}![software](img/Software.svg "Own graphic based on [Mullendore et. al. (2021):](https://doi.org/10.3389/fclim.2021.763420) Data and software elements to be preserved and shared by all projects.") {{2-3}} __Checklist for RDM in simulation and software development:__ {{2-3}} - Train yourself and your team in software development quality - State the purpose of each program in a few words - Keep your software up to date with quality management - Keep the intention of every function clear (by naming or comments) - Choose an appropriate license - Establish quality management in your simulation process {{3}} >**DMP Task**: > > If you have self-developed programs in your project, enter them as a separate dataset in your DMP, name all contributors and the purpose of each program. {{4}} **It's Quiz Time!** {{4}} 1. What is the principle behind "as open as possible, as closed as necessary"? {{4}} [[ ]] Openly sharing all research data and tools (G) [[ ]] Using closed and proprietary software for data analysis (U) [[x]] Striving to choose open and FAIR tools and data types when possible (A) [[ ]]Restricting access to research data and tools (W) {{4}} 2. What should researchers do if timely open access to the software is not possible? {{4}} [[ ]] Provide a detailed algorithm without outlining the source code. (I) [[ ]] Delay the publication of the article until the software is open access. (P) [[x]] State the reason for not publishing the source code and the expected release timeframe. (T) [[ ]] Exclude any mention of the software in the article. (S) {{4}} 3. What are the challenges associated with the longevity of simulation outputs? {{4}} [[ ]] Simulation outputs are easily assessable and portable. (M) [[ ]] Simulation outputs are often small in size, making long-term accessibility simple. (N) [[x]]Simulations can generate massive output, and interdependencies between hardware and software can limit their portability. (A) [[ ]] Standardization and documentation for simulation outputs are well-established. (X) {{4}} 4. What should be done if some software used in simulations cannot be openly released? {{4}} [[ ]] Provide no documentation or information about the restricted software. (T) [[ ]] Exclude any mention of the restricted software in the documentation. (O) [[x]] Ensure the documentation provides enough information to reproduce the process. (M) [[ ]] Delay the publication until all software used in simulations can be openly released. (F) {{4}} 5. What is one of the considerations for ensuring long-term accessibility of research data? {{4}} [[ ]] Using tools and software that are FAIR themselves. (V) [[x]] Choosing data types that require minimal memory space. (A) [[ ]] Storing data in repositories with high backup capacity. (J) [[ ]] Describing processing details in a .pdf file. (Z) {{4}} 6. What are the recommended characteristics of software documentation in research? {{4}} [[ ]] Avoid documenting the origin, purpose, and scope of the program. (R) [[ ]]Provide a detailed manual with step-by-step instructions. (K) [[x]] Comment the program with limitations of variables. (N) [[ ]] Keep the documentation minimal and concise. (C) ### Backup {{0-1}} Backup needs to be prepared by well-structured and named data. When naming files and folders, you should adhere to the following standards: {{0-1}} - Give comprehensive names - Use uniform schemes - Develop logical structure - Include date in the following form: YYYYMMDD - Avoid spaces and special characters {{0-1}}| DO | DON'T | | :--------- | :--------- | | 20230312_h2oSample1 | BDD extract_edited colored.jpg | | 20230315\_location5\_processed | original.jpg| | 2018_weatherlogfiles | table1_john(copy) | {{0-1}} Source: inspired by [Biernacka et al (2020)](https://doi.org/10.5281/zenodo.4071471): p. 31. --{{0}}-- You should develop a backup policy early on in your project, since later down the road it is very hard to re-structure established processes. {{1-2}} A handy rule is the 3-2-1 rule: {{1-2}} ![backup](img/Backupregel_CCBY4.jpg "I. Lang/Bearbeitung E. Böker: 3-2-1 Backupregel. Edited by A. Ahrens: Translation. Licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en). Source: https://forschungsdaten.info/") --{{1}}-- The 3 2 1 rule states that you are to create at least three copies of your data, secure them on two different kinds of storage medium and make sure that one of those mediums is located off-site. This way, your data is protected from natural disaster, which will most likely only hit one location at a time, against accidental deletion and against deterioration of one kind of storage medium. For example, CDs have quite a long lifespan but CD drives as a reading device have become increasingly scarce. {{1-2}} > __Exercise__: > > Which backup solutions does your institution offer? If they do not meet your requirements, calculate additional costs in order to include them in your project proposal.Exercise:
Look closely at the graph to identify which measures especially apply to your area of work and data types you are using.