Assignment 1 Maurits Houmes

EstherPlomp / TNW-RDM-101

Self paced materials of the RDM101 course

https://estherplomp.github.io/TNW-RDM-101/

Creative Commons Attribution 4.0 International

6 stars 2 forks source link

Assignment 1 Maurits Houmes #39

Closed mausi122 closed 1 year ago

mausi122 commented 1 year ago

Introduction

Hi all, I'm Maurits Houmes a 3rd year PhD student at QN. I have a dog and more hobbies/interests than I have time for.

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

My research revolves around investigating material properties of two-dimensional materials trough nanomechanical means. We do this by creating nanodrums of the materials of interest and than looking at the resonance frequencies of these drums. Since the resonance frequency of these systems are sensitive to a lot of different things we are able to investigate the material properties, the big challenge with this is to disentangle the different effects that effect the resonance frequency.

My research entails the following aspects:

Research Aspect	Answer
Use/collect personal data (health data, interviews, surveys)	No
Use/collect experimental data (lab experiments, measurements with instruments)	Yes
Collaborate with industry	Maybe
Write/develop software as the main output of the project	No
Use code (as in programming) for data analysis	Yes
Work with large data (images, simulation models)	Yes
Other:	N/A

Reflections on the importance of RDM videos

The points made to support RDM seem like open doors as most of it was already thought to me in my Bsc., but since starting my Phd it has become clear to me that although most people agree this is a good idea in reality it is not always implemented. A timely example is a question I got last week from a colleague who wanted to use some data measured in 2016 (so long before I got here) on a set up I'm currently working on. After some searching we figured out the only place the data was stored is some old discussed PC that was left in the cupboard in the lab that I only happened to know existed because of having come across it looking for something else once. I've tried pointing out to my PI's some issues with the way we do it now but they keep saying I shouldn't waste my time on improving it.

What would you like to learn during this course?

(See above reflection) I'm very aware that the current way we store our data and handle our code in our lab is not very efficient or safe, so I definitly want to improve this but I'm not sure where to start or what would be a good system to set up. Looking around online I almost only find concrete examples of how to deal with it for very large data sets, very different type of data, personal data or focused on code. Non of which I have been able to map on to my situation, so I'm hoping that after this course I have a better idea of how to go about this.

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before Class 1)
[x] Respond to the GitHub issue on 'data challenge'
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 17 May 13:00). Link: Assignment 2
[x] Provide feedback to at least one Assignment 2 from another participant
[x] Respond to the GitHub discussion on 'licenses'
[x] Respond to the GitHub discussion on 'folder structure'
[x] Respond to this GitHub issue with your readme file
[x] Provide feedback to at least one readme file from another participant
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 31 May 13:00). Link: Assignment 3
[x] Provide feedback to at least one Assignment 3 from another participant
[x] Assignment 4: Data Management Plan (before Class 2)
[ ] Respond to the GitHub discussion on 'Data Management Plans' if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before Class 2)

EstherPlomp commented 1 year ago

Hi @mausi122 ! Thanks for handing in your assignment 2!

It looks very good and extensive: well done!

I have a couple of comments/suggestions to consider:

Good idea to also have a back up of the physical lab notes: I'm not sure if it is manageable in your case, but you can also convert your physical notes into digital notes if it is workable (some notes are more visually based than words, which is more complicated and it looks like that might also be the case for you). Scanning/photographing the lab book is definitely better than only having the physical copy that doesn't have a backup somewhere!
I like the division between your current practices and the goals: Good luck with achieving the storage goals!
I see you both indicate GitHub/GitLab for the scripts: Are you still doubting between the two? GitLab is ideal for projects that cannot be publicly shared and contain sensitive data/code. GitHub is better for working together with externals and having more visibility for your work.
What is the image editing software you used to post process images? Do you document these steps somewhere? Scripts are fairly easily reproduced, but following manual processing steps can be more of a pain! You'll dive deeper into that for Assignment 3!
Just a note on OneDrive/Project Drive: One Drive is also a great solution and backed up - the project drive is especially helpful for large datasets and as a backup solution since it is less userfriendly compared to OneDrive (I would say).
Can you maybe elaborate what you mean with the dynamic flag?

mausi122 commented 1 year ago

Hi @EstherPlomp,

Thank you for the feedback.

The GitHub/GitLab mention is indeed because I'm not sure witch of these to use. I've some basic experience with GitHub, but from the TU Delft storage solutions I thought that GitLab was the method preferred by the university. I don't really have any sensitive data/code but also most all the code isn't that useful for people outside our lab either. So I'm not sure which is beter to use and I think any difference would be small.
for Image editing I mainly use adobe Illustrator for changing aesthetic parts of figures to fit the publication (change fonts, font size, colours used etc.) any data processing I do using Python scripts before that. I found that it is useful to keep the adobe illustrator .ai files as it allows for easy reuse and small changes of the figures for posters or presentations.
I don't really understand the second to last point you give, about the OneDrive/Project Drive: could you maybe eleborate on it? I currently use the personal "OneDrive - Delft University of Technology" account to act as a backup for my laptop mostly. But since this expires when I leave I'm not sure how useful it is for datasets.
By the dynamic flag I mean that these scripts are changed a lot, and the changes are very much on a case by case basis for each sample or measurement run. So this has the challenge that I want to be able to tell from each dataset which exact version was used, but as they have a lot of very minor variations that change back and forth full versioning will end up with easily 50+ versions per month. And it isn't the case that version 2 would be a improvement over version 1 more of a (temporary) tweak.

francescozatelli commented 1 year ago

Hi @mausi122,

The data flow map is very detailed and it looks good! It was interesting to read it because I think we work in similar ways under certain aspects. I also had to deal with some of the challenges you are facing now :)

We also had to backup the measurement data of a dedicated measurement PC. The way we implemented it eventually is to have a .bat script that copies the whole drive of the measurement PC to the U: drive. You can use 'robocopy' for this (it's a feature of Windows, so you don't need to install anything) and you can easily find online how it works and customize it. It's basically just a one line script. Then we use the Task Scheduler of Windows to run this script every hour. Only the changed files are copied, so it's efficient. This has been working quite well so far and it's very easy to implement.
For the measurement scripts have you considered using QCoDeS? It's a data acquisition framework that could take care of a lot of these things. With it you can control the instruments you use to run your experiments and store the results in databases. The nice thing is that together with the measurement results, it automatically stores plenty of metadata. For example, you can store as metadata all the parameters of all your instruments so that they can be retrieved later on. I'm not sure if this is applicable to your case, but if the changes in the measurement scripts are really minor it could be an idea to have one general script and store its details as metadata.

ArjanMejas commented 1 year ago

Dear Maurits,

That‘s a very well structured review of a rather large and complex set of data.

Best, Arjan

mausi122 commented 1 year ago

Hi @mausi122,

The data flow map is very detailed and it looks good! It was interesting to read it because I think we work in similar ways under certain aspects. I also had to deal with some of the challenges you are facing now :)

We also had to backup the measurement data of a dedicated measurement PC. The way we implemented it eventually is to have a .bat script that copies the whole drive of the measurement PC to the U: drive. You can use 'robocopy' for this (it's a feature of Windows, so you don't need to install anything) and you can easily find online how it works and customize it. It's basically just a one line script. Then we use the Task Scheduler of Windows to run this script every hour. Only the changed files are copied, so it's efficient. This has been working quite well so far and it's very easy to implement.

For the measurement scripts have you considered using QCoDeS? It's a data acquisition framework that could take care of a lot of these things. With it you can control the instruments you use to run your experiments and store the results in databases. The nice thing is that together with the measurement results, it automatically stores plenty of metadata. For example, you can store as metadata all the parameters of all your instruments so that they can be retrieved later on. I'm not sure if this is applicable to your case, but if the changes in the measurement scripts are really minor it could be an idea to have one general script and store its details as metadata.

Hi @francescozatelli

Thanks for the feedback. I'll definitely check out the 'robocopy' seems like a good solution. We also already where thinking about using QCoDeS but so far haven't implemented it as a lot of the equipment we use has no existing drivers which will be a lot of work to write. We have been using it occasionally on a similar setup which we use a bit more as a test bed so maybe in the future we can replace the scripts with it.

EstherPlomp commented 1 year ago

Thanks all for the replies and helpful input!

* The GitHub/GitLab mention is indeed because I'm not sure witch of these to use. I've some basic experience with GitHub, but from the TU Delft storage solutions I thought that GitLab was the method preferred by the university. I don't really have any sensitive data/code but also most all the code isn't that useful for people outside our lab either. So I'm not sure which is beter to use and I think any difference would be small.

GitLab is not necessarily the preferred solution by TU Delft - we just have an instance for it that is more secure. But if you're not working with sensitive data and you don't have external collaborators it doesn't matter no. You could try out both and see which one fits better, or just pick one and stick with it :)

* for Image editing I mainly use adobe Illustrator for changing aesthetic parts of figures to fit the publication (change fonts, font size, colours used etc.) any data processing I do using Python scripts before that. I found that it is useful to keep the adobe illustrator .ai files as it allows for easy reuse and small changes of the figures for posters or presentations.

Thanks for your elaboration there!

* I don't really understand the second to last point you give, about the OneDrive/Project Drive: could you maybe eleborate on it? I currently use the personal "OneDrive - Delft University of Technology" account to act as a backup for my laptop mostly. But since this expires when I leave I'm not sure how useful it is for datasets.

Sure! What I meant was that you can also use OneDrive as your 'active' storage solution when you process the data first: after you do not have to use it as much you can then transfer it to the project drive. This is easier when you want to work on multiple devices and/or when you don't have internet and want to work locally. You should indeed not use OneDrive as your long term storage location as the account will indeed expire once you leave the TU Delft. I hope this clarifies things? Please let me know if not!

* By the dynamic flag I mean that these scripts are changed a lot, and the changes are very much on a case by case basis  for each sample or measurement run. So this has the challenge that I want to be able to tell from each dataset which exact version was used, but as they have a lot of very minor variations that change back and forth full versioning will end up with easily 50+ versions per month. And it isn't the case that version 2 would be a improvement over version 1 more of a (temporary) tweak.

That sounds complicated indeed! I'm not sure if I have alternative solutions to QCoDeS (and/or GitHub/Lab). I'll ask some colleagues for advice and see if they come up with anything else!

EstherPlomp commented 1 year ago

Just to add a comment from a colleague I received so far:

If it is a script which has minor changes (typically configuration of measurement devices or experiment parameters) add it to the dataset as a file, but do not commit to git. I consider that a lack of standardisation (which is ok). Ideally you would want to add a standardised config file to the dataset, but in lack thereof a flexible python script is the alternative which I would add to the dataset (also when publishing data).

mausi122 commented 1 year ago

Custom README Template: README - Custom Template.txt This is the template for the most common type of data sets in my project (but there are a lot of different types for which I'll have to modify this. But they'll follow this overall structure.

EstherPlomp commented 1 year ago

Thanks for sharing your README template! I'll have a look later this week!

Another tool/software to try out could be Dataled.

EstherPlomp commented 1 year ago

Well done on assignment 3 @mausi122 ! It again looks great and comprehensive! 👍

Just some small comments from my side:

Data organisation

Be careful that your file names for data are not getting too long: that might hamper smooth data transfers and you'll always have to type a lot of information for the file names (unless this is automatically generated :))

Documentation

Have you tried RSpace as a labnote tool? Others following the course have suggested OneNote and Miro as labnote tools.
The code refinery offers some material on code documentation that you may find helpful.

Access

The project/U drive an also be accessed by externals if they have a netID which can be set up for them by the servicedesk. Otherwise SURFdrive and/or SURFfile sender are also good solutions!
A note on using supplementary materials: data/information shared via these materials is not following the FAIR principles, for which you'll need to share data/code via a data repository. Sharing data via supplemental materials also has some other downsides.
You can also uses data repositories such as Zenodo to share presentations/posters! Here's an example.

Publication

While GitHub is a great platform to manage and track your code, it is not great in terms of long term preservation and assigning DOIs. For this, you can share a snapshot of your code via a Data repository. Zenodo has a direct integration with GitHub and 4TU.ResearchData allows you to share repositories using Git. It seems like extra - but GitHub can be taken down by microsoft and has no long term preservation policies in place. So even if you want people to cite the publication and don't want the DOI for the software, it is still important to make use of the data repositories (and also this is in line with the TU Delft requirements :))
When publishing Open Access with a journal: always select the CC-BY option. The other option (CC-BY-NC-ND) generally contains some transfer of rights which only protects the publisher's rights and not yours/TU Delfts.
Indeed discuss with supervisors/collaborators which repository is preferred!

And I will still have a look at your READme file - apologies for not having looked at it yet!

francescozatelli commented 1 year ago

Hi @mausi122, the readme file looks good. I think you included all of the fundamental information needed. Just a couple comments. Adding some instructions to explain how to read and plot the data (and/or some minimal scripts to do it) could be very convenient for people interested in your data. If I understood correctly, you provide the raw datasets (.mat files) and the plots (.png), but not the scripts to go from one to the other. Another detail that could be relevant to add is an explanation of what data is included or not (is it all the data? If not, what is not reported?)

francescozatelli commented 1 year ago

And here is the feedback for Assignment 3, sorry for the double notification. This also looks great and very detailed! Just a couple comments:

In some cases you add the suffix V[version_number]. In my experience, tends to be problematic in the long term unless you are very disciplined. Maybe some form of version control could make your life easier.
I read that you are not planning on sharing the analysis and plotting scripts. I'm not sure if it would be possible to sahre them in your case, but that could help making your data accessible and reusable, especially if the data analysis is not exactly trivial.

EstherPlomp commented 1 year ago

Thanks for sharing your readme file Maurits! I think you have set things up nicely with the placeholder information - well done!

I think Francesco's comments are helpful here, and don't have a lot to add apart from:
Please consider a descriptive title of your dataset. This should contain some information about the dataset, or refer to the accompanying publication.
Your file names could become very long it seems - be careful for file names that are too long, this can complicate data transfers and you'll have to always type in these file names (unless you generate them automatically).

PabloVelazquezGarcia commented 1 year ago

Hey Maurits, here is my feedback for Assignment 3

Both Esther and Francesco have already talked aboutthe most interesting suggestions. I would like to ask, how do you manage to show to your students how to properly store all the data in such a complex structure? It is definetlly very well organized, but I wonder if it is easy for bachelor and master students to keep up with such a complexity. I try to use very intuitive structures and names for my files but this is not easy when the ammount of data that you create is this big!