Assignment 1 Pim Vree - Githubissues

Introduction

Please briefly introduce yourself here, for example: Hi all, my name is Pim Vree and I'm a PhD student in the van der Sar lab. I also have two cats!

Reflections on the importance of RDM videos

Reflect about what you heard on the video and briefly write your thoughts and your horror stories in less than 5 sentences. Nice and insightful of why data management is important. My horror stories mostly include experiments where either important measurement settings where not stored properly in the data files or not at all.

What would you like to learn during this course?

Are there any things in particular that you would like to get out of this course? Do you have any goals that you would like to work on?

I think a good way to structure, store research data and a structured why to have an overview of my data analysis.

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before Class 1)
[x] Respond to the GitHub issue on 'data challenge'
[x] Assignment 2: Data Flow Map 1 (link](https://surfdrive.surf.nl/files/index.php/s/a4sNyvcNpq0Dbg2))
[x] Respond to the GitHub discussion on 'licenses'
[x] Respond to the GitHub discussion on 'folder structure'
[x] Respond to this GitHub issue with your readme file
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 14 March 13:00)
[x] Assignment 4: Data Management Plan (before Class 2)
[x] Respond to the GitHub discussion on 'Data Management Plans' if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before Class 2)

Hi Pim (@phvree), I am Paula and I am the instructor from the library supporting Esther with this run of the course. I normally teach in the RDM101 provided to all faculties. Thank you very much for your assignment! I had a look and I have some questions/suggestions; About the listing of data types and code. It good good to think separately about data, design data/information and code. But, I think for the purpose of the assignment and also thinking about the assignment 2, it would be good to maybe split a bit more or create subcategories of data and code. For example, for the experimental data you will be also collecting different types of data within that category. The instruments used for collecting images will be different as those for collecting sensor data, right? It means that the file formats, the software use for collecting them and the documentation you will need for them will also be different, right? The same goes for the experimental data from TNO. If you do not have detailed information about the data you will have access to and in which format, etc., it is fine to group that data for now. But, it will be useful to create a more detail listing and descriptions, especially when thinking about data organization. It got my attention that you describe the experimental data as .mat files. Is that because you will process the raw images and the raw sensor data in MatLab? If that is the case, then you should probably distinguish between raw data and the description, file formats and size of those, from the analyzed data, which might be stored in .mat format. I can be saying something totally wrong, so it would be great if you could comment on that, so I also learn from your workflows :-) Maybe you could have a look at @SDev5 sercan's assignments to compare? He seems to have similar data types and code as you, maybe it can help to look at your data and code from another point of view? A couple of your peers have mentioned this NAS server, which is fantastically synchronized with the computer of instruments and laptops. If the computers of the instruments are synchronized with the NAS, why are you also using an external drive? Is that your personal external drive? Or, is that an external drive which is always connected to the PCs? I hope that TNO is willing to use one of the secure alternatives for sharing data that TU Delft recommends? Either Project Data (U:) drive or SURFdrive, maybe (if the data is not highly confidential)?

@EstherPlomp, is this NAS server something provided by ICT for applied sciences? Or, is this NAS server administrated by the faculty?

Here is my template for the readme ReadmePhD_Pim_vree.txt

@MpaulaL thanks for the extensive feedback.

In our measurement setup almost all sensors and systems are controlled and read out through Matlab, so that we have a central system that coordinate and control everything during a measurement. Typically, the raw data is collected through the drivers also in Matlab where it is stored in a .mat file together with relevant experimental data ( which is most of the time also controlled through matlab) such as the temperature of the sytem, laser power, position of magnet etc. These parameters are thus directly set through matlab but important to store for further analysis. Typicaly, we have thus .mat files that store the data and .m files for matlab scripts and functions to analyze and capture the data, you are right that I could further divide those files in a subcategory.

The external drive is connected to the lab PC and it is mainly used to store the data during the measurements ( with the main benefit that it will also work without an internet connection and is a lot faster) For TNO the exact data is still and work method is not yet properly defined but I will try to be a bit more specific, but I guess it will be mainly a carbon copy of my main project datasets.

RDM101_Assignment2_Week2_DataFlowMap_Template.pptx

Here is the link to the ppt containing my updated datasets based on your feedback and the information containing to the FAIR principles.

I did have a hard time answering some of these questions, cause I really did not know what is standard in the field and how to do all of these things in an efficient manner, while keeping the current consistent structure of the project. For the coming weeks I will try to see if I can come up with a solution to this effectively. But I feel like where the formats described in the examples come short is in projects spanning multiple people and many years, with a wide range in sub projects. There I think it is better to store the data and the code all in a separate huge folder than each sub project having a folder for data and code.
so instead of having this: ├ Overall project ├── subproject 1 │ └── data │ └──code ├── subproject2 │ └── data │ └──code

I think it is better to have this ├ Overall project ├── data │ └── subproject 1 │ └── subproject 2 ├── code │ └── subproject 1 │ └── subproject 2

As already mentioned it is not yet clear how confidential the TNO data is ( or at all), so I could not really draw any different conclusions for these data sets.

@MpaulaL thanks for the extensive feedback.

In our measurement setup almost all sensors and systems are controlled and read out through Matlab, so that we have a central system that coordinate and control everything during a measurement. Typically, the raw data is collected through the drivers also in Matlab where it is stored in a .mat file together with relevant experimental data ( which is most of the time also controlled through matlab) such as the temperature of the sytem, laser power, position of magnet etc. These parameters are thus directly set through matlab but important to store for further analysis. Typicaly, we have thus .mat files that store the data and .m files for matlab scripts and functions to analyze and capture the data, you are right that I could further divide those files in a subcategory.

The external drive is connected to the lab PC and it is mainly used to store the data during the measurements ( with the main benefit that it will also work without an internet connection and is a lot faster) For TNO the exact data is still and work method is not yet properly defined but I will try to be a bit more specific, but I guess it will be mainly a carbon copy of my main project datasets.

Thanks for the explanation @phvree! Although I am not a fan of MatLab for being a proprietary format, I think t is a great automated process that you have their at your lab! :-D

Hi @phvree, thank you very much for your assignment! And thanks a lot for considering my feedback and extending the data types and code list! I think it gives a much better overview.

There are several of you using the group server for collecting the data from instruments. This is very nice to make sure that all the data is securely save, but it can be a challenge to make the data easily findable and linked to other data. Maybe in the next class you could discuss a bit about this with your peers. But, I think you have done a very nice reflection on how to better organise the data and code. You will need to get started and then see if the plan fits with your workflow. If not, you could do some adjustments. I've seen also your assignment of folder structure and it looks a bit different to the one in the PowerPoint. Do you have an inclination on which one to use? I think it will be very important for you to discuss this with your supervisor and maybe team members. If the improvements in the folder structure will be only used by you in the common server, it might challenge others to find the data. If there is no wish to improve the folder structure from your supervisor and team members, maybe you could keep the main folder structure: ──./NAS/Measurement_data/Setup/ │└── type of sample │ │ └── sample │ ││ └── analysis <- scripts for intermediate and final analysis │ ││ └── Measurement data <- raw and intermediate data And do little changes, like adding some ReadMe files for documentation on how to navigate the folder structure, describing the file naming conventions and adding the paths to linked data or code. You could also maybe create a good subfolder structure within the analysis folder, separating the scripts for intermediate analysis from the final analysis. And in the measurement data, separate the raw from the intermediate data? It is good and great actually that you now know better ways to organise the data. If at any point you can decide yourself how to do it, you have already a pretty good idea :-) Maybe there is an opportunity to organise the TNO data in the way you envisioned?

About documentation. Great that you thought through what type of information you should collect. About the question using a ReadMe file or include it in the .m file. I think there are pro-and cons- to both. A . txt file can be opened with any text editor, open file format yay!. But, if there is a way to automate the collection of documentation, it is a nice things to do thinking about how efficiently document well the data. Since you are also using OneNote as kind of lab book, make sure that documentation you have in other places is linked to your experiments. It could be just indicating the path to where the right documentation is. If possible, try to be consistant on how you organise your experiments in OneNote. Maybe you can create kind of templates? Here you can remember some of the basic tips that we provide for paper notebooks, they can also help here when using tools like OneNote as a lab book: • Write in a commonly agreed language, e.g. English • Write down the date for each record • Make sure that it is possible to separate the different experiments – e.g., by using meaningful names • Note where the raw experimental data will be stored and the name of the corresponding data file About your idea of writing a ReadMe file for some of the whole datasets, I think is great. Sometimes, you don't need a ReadMe file per file, but you can create a ReadMe for the folder where to tell a bit about what type of data is compile in that folder, how it was collected (which instruments, model of it), description of file naming conventions, description of file format and how to open the files, etc. That type of information would apply to all datasets.

About metadata. I totally agree with your decision. If metadata is collected automatically during the measurement and is attached to the data, that is great. A good file naming convention, providing meaningful extra metadata, is a great choice!

File formats. Just remember to add the information in the ReadMe file about what is .m and which software to use to open them. The same for Autocad and Kicad. If in your group it is a standard to work with MatLab, ti will require enormous effort to transform the code to .py. So, just make sure you provide the relevant information about file formats in the documentation.

About Access. I see that there are uncertainties about making data openly available. I guess that is why there is also not a section for Data Publication? It is important that you discuss this with your supervisor or ask your team mates if they have any repository of choice. 4TU.ResearchData is always a good choice. It provides a DOI per dataset and/or code, you can provide a license, you can add more standardized metadata and as TU Delft researcher you can deposit up to 1TB per year. It would be great if besides the raw data, you also publish the analysis scripts to demonstrate how you generated figures and graphs in your publications. The same for the scripts of the model. About the license, CC-BY-SA. It is an intermediate open license. If possible, I would go for an even more open option, CC-BY. Sometimes we have good intentions restricting the licenses, to ensure that the data is also shared openly. But, if somebody builds upon the data of your project, maybe adding more data to the dataset and would like to publish it as open as possible, they could not use a CC-BY or a CC0 license if they have reuse the data of your project. I hope you found the exercise useful Pim. Do not get discourage if you can't implement all best practices immediately or at once. If needed do it on incremental steps!

Thanks for sharing your READme @phvree, well done!

Just a couple of pointers:

Will you have enough space in the paper to fully elaborate on all the method/code details?
It can be helpful to at least have the contributor information listed in the READme file so that people know who to contact.
I have been leaving some information about ORCIDs in other people's feedback, so I'll also share here just in case: An ORCID an identifier for researchers. Similar like the identifier to the publication, it provides a way to easily identify you as a person and associate your research outputs with you. You can set up your own ORCID very easily (and use it for some scholarly platforms such as Zenodo/publishing platforms). You can have a look at Toeno's ORCID to see what it looks like when there is information in the profile.

Hope that helps!

EstherPlomp / TNW-RDM-101

Assignment 1 Pim Vree #29

Introduction

Reflections on the importance of RDM videos

What would you like to learn during this course?

Checklist assignments