Assignment 1 Sercan Deve

SDev5 commented 1 year ago

Introduction

Hi all, I'm Sercan Deve and I'm doing my PhD in the department of Quantum Nanoscience on superconducting quantum circuits.

Reflections on the importance of RDM videos

I think that the facilities in our lab with regards to data management limit the possibilities of losing large amounts of data, since all of our data is stored not only locally on the lab computers, but also on a NAS device (network-attached storage) which is also backed up. Additionally, we are encouraged to push all relevant files and code related to our projects to Gitlab. Personally, I also sync all my files to OneDrive. Nonetheless, these videos reminded me again of the importance of good data management and proper documentation, something that I should sometimes pay more attention to. The five reasons for reproducible data by Markowetz were really informative and he definitely convinced me that working reproducibly is the way to go. My data horror story is related to a library in our lab that we use to write our measurement code. This library is on git, but every measurement pc has its own version with local changes. Another story is from a recent measurement run that I did, in which I made a mess of all the different measurement file names...

What would you like to learn during this course?

I am interested in learning more about the FAIR data principles and how to apply them to my research data. This is a term that I have heard several times, but I never found the time to look more into it. Additionally, I would like to learn the best practices for data management and see if these practices are implemented in our lab. If not, I would like to implement them.

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before Class 1)
[x] Respond to the GitHub issue on 'data challenge'
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 7 March 13:00)
[x] Respond to the GitHub discussion on 'licenses'
[x] Respond to the GitHub discussion on 'folder structure'
[x] Respond to this GitHub issue with your readme file
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 14 March 13:00)
[x] Assignment 4: Data Management Plan (before Class 2)
[x] Respond to the GitHub discussion on 'Data Management Plans' if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before Class 2)

SDev5 commented 1 year ago

Assignment 2: Data Flow Map 1 https://tud365-my.sharepoint.com/:p:/g/personal/sercandeve_tudelft_nl/EUMxmOhOyZVAsCF5a94NSc0BvBezy75WqKX0j8RB10_MKg?e=YDndJZ

MpaulaL commented 1 year ago

Hi Sercan (@SDev5), thank you very much for this complete assignment! It is very good, so I do not have much to comment on it :-) The list of datasets is very comprehensive and it is really great that you dedicate the time to be very thorough. It is also very good that you separated data from code, but also that you identify different type of data and that you separated the code according to their specific use. Well done! It seems that your group has a very well thought storage and backup strategy! That's also great!. I find that that the use use of tools like JupyterHub, GitLab in synchronization with the group server is great. Also, very important is that the computers from the instruments are also sync with the server. And, I think using Project Data (U:) drive for the group is a very good practice. I am curious if you have a predefined folder structure for that? Data organization will be part of the assignment of week2, so probably I will see it then ;-) Very well done! Did you had a reflection before about the types of data you work with and listing them?

MpaulaL commented 1 year ago

I am Paula by the way! I am the instructor supporting Esther in this run of course :-D

SDev5 commented 1 year ago

Hi Paula, thank you very much for your feedback. Before listing all the data types, I sat down and went through our typical research work flow. From there I identified all the data types I will be working with. So in that sense, I did have a reflection about the data types, but not prior to this course.

Maybe to clear something up, in our group we do not use the U: drive for data storage. The only data that is stored on the U: drive is the cleanroom inspection data which is collected on the computers of the cleanroom, the Van Leeuwenhoek Laboratory. The data collected there is not stored on the U: drive of our group, but on the U: drive of the Kavli Nanolab. This data is then again synchronized to our NAS such that we can access it from outside the cleanroom on all of our computers.

I have also asked my supervisor for some clarification with regards to the JupyterHub server and backups of our NAS. The JupyterHub server is apparently hosted by the faculty, but our directories are mounted on the NAS. The NAS itself is nightly backed up on university servers and these university servers are also backed up in a separate location, providing us with a very robust and easily accessible back-up and storage system.

MpaulaL commented 1 year ago

Hi Sercan, thanks for the clarification! I didn't know that the Kavli Nanolab was another lab independent to your group :-)

It is great that you now know the backup strategy behind the infrastructure you use! And it is also great to know that the NAS server complies with the 3-2-1 rule :-)

SDev5 commented 1 year ago

Readme file for a dataset that I measured at the end of 2022/start of 2023: https://tud365-my.sharepoint.com/:t:/g/personal/sercandeve_tudelft_nl/ES6FuCwcAgFOkWndYkab1vkBmKIJ2TxoQRbyoeDpXlH0_g?e=tdI8eD

SDev5 commented 1 year ago

Assignment 3: Data Flow Map 2 https://tud365-my.sharepoint.com/:p:/g/personal/sercandeve_tudelft_nl/EUBRBMWbmOBNgzuhXW1fLjkBlWn8sGfUt0lgiNQFQi0dfQ?e=Fg2Nis

annickteepe commented 1 year ago

Hi Sercan, I have looked at your dataflow map. You have split out everything very nicely. I was wondering on how you store your metadata (what parameters you set for measurements, but also in your analysis code)? (I struggle with finding a good method for this myself :) )

SDev5 commented 1 year ago

Hi Annick, thanks for your feedback. For our measurements we have code in our measurement scripts that automatically generates a .txt file that stores the measurement parameters, so the axes and set parameter ranges, as well as the names of all collected variables. This meta .txt file is stored in the same folder as the measurement data, and the information that is contained within the meta .txt file allows us to automatically plot the measured data with correct axes names and ranges. I have added an example of such a .txt file below: https://tud365-my.sharepoint.com/:t:/g/personal/sercandeve_tudelft_nl/EYl6p1chAyxBp0CyNVmq9P4BVjbyhaNoU6mO0blqAEK3Tg?e=QClhPQ

Regarding analysis code, personally, I work in Jupyter notebooks, so I can directly write documentation in between code cells. I think this works well for now and allows me to easily reproduce results. Currently, I don't store separate metadata for processed or analysed data outside of the Jupyter notebooks and I don't think we have a standard way of doing this in our group. But it might be good to develop that, and if I would do so, I think I would use a similar format as for the measurement metadata.

MpaulaL commented 1 year ago

Hi @SDev5 Sercan! Very well done again! Thank you very much for taking the time to reflect about each theme for each data type and code! :-) I do very few comments actually. I do write a lot, but they are a few comments :-P

I think it is excellent that you could adopt Jupyter notebooks for the documentation of data, code and data analysis. I think that Jupyter notebooks accompanied with a complete information in the form of Markdown comments, docstrings for the functions and maybe addition of some extra metadata are a great way for other to reproduce your work. I know some researchers who always publish their Jupyter notebooks with the data analysis/workflow to create the figures presented in their articles. Maybe that is something you could think about ;-) I especially like that you thought of possible solutions to implement your folder structure and also about ways to improve code versioning to be adopted by your group! I really hope they adopt your idea of transferring the scripts for data collection to GitLab. I think it should possible to add shortcuts to the folders were data is automatically stored. If not, you can keep the folder in your structure and create a ReadMe file (.txt) to provide a bit of context on the measurement and indicating the file path to find the right data for that project. About file formats. If there is a licensed software that needs to be called when creating your simulations, it will be important to add that information in the documentation of the code. Which software is it, maybe the version and the distributor. For the measurement data, you say that is collected in .dat which is open, but it could be converted to .csv. Do you think that this conversion provides an advantage for a potential re-user? Because if .csv will not make a difference to .dat, then maybe you can save that step? A general question about data access to the design-related data. Is it normally not publicly available because your group or you expect to exploit this commercially or patent it? Within open science some researchers have engaged on what is called Open Hardware, where design documentation and code is made available to anybody who would like to fabricate devices. I am not saying your group should go that way, I guess there are reasons to keep that data/code with restricted access. But, if you want to know more about Open Hardware, there is a very nice community at TU Delft :-) https://www.tudelft.nl/open-hardware About data publication. It is great news that your group use Zenodo, it is a very good choice for data publication. About the license, I would like to do three remarks:

Deciding which license to use should never be your decision alone. Please take the time to discuss this with your supervisor and if needed involve Esther in those discussions to receive some guidance. If she can't answer your questions, she will who to contact to ask ;-)
Creative commons licenses are good for data, but not good for software. There is a whole suit of licenses for software. You can check here a very complete comparison matrix: https://choosealicense.com/appendix/ But, remember that at TU Deflt there are some pre-approved licenses for software. You should have a read to the Research Software Guidelines to find more information about the workflow you have to follow if you want to publish and apply an open software license to the code of your project: https://doi.org/10.5281/zenodo.4629635
About adding restrictions to the CC-BY license. You say that you might want to add an NC clause. I think researchers very often think that adding the NC to the license will prevent companies to 'steal' the work. From what I learned it is very unlikely that companies will manage to create a product solely from data that is available out there. They will always invest money in R&D and/or will partner with researchers to go deeper in the research to create a product. I don't know if this is different in your field? And, second, a NC clause will also prevent maybe smaller businesses to reuse the data/code. What if in the future you create your company/spin-off and you are not a TU Delft researcher? You yourself would not be able to use the data/code to build upon them. Again, the decision should be yours together with your supervisor but adding restrictions to CC license can have effects that you do not think about immediately. So, it is good to dedicate time to think about this ;-)

Very well done! I hope the exercise was useful!

EstherPlomp commented 1 year ago

Thanks for sharing your READme file @SDev5! It looks great, well done!

A couple of things:

Nice to see that you already have an ORCID!
You'll probably have to include your supervisor as a contributor to the work
For the recommended citation of the data, you can indicate here whether you want people to refer to the related publication, or set up the citation for the data (see here for more information about data citation, also linked in the second session).
On line 55, is there a public link to the GitLab file/repository? If not, then it doesn't make sense to include it.
Are there any dependencies/package versions etc that need to be listed? Or is this information available in another file?
For the "quality-assurance procedures": are you doing any calibrations, or making any decisions about when to remove data when it is not 'good' enough? These could be quality assurance procedures: how do you decide whether the data is valid/good enough to use?
If there are any questions that are not relevant for your data, you can also remove them. If you can add the information at a later point you can use a placeholder such as 'to be updated'.

I hope this helps!

EstherPlomp / TNW-RDM-101