Assignment 1 Francesco Zatelli

francescozatelli commented 1 year ago

Introduction

Hi, I'm Francesco and I'm a PhD student at QuTech in Applied Physics.

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

I work on semiconductor-superconductor hybrid nanowires with the goal to create and manipulate particles that could be useful to build a quantum computer. We fabricate nanoscale devices in the cleanroom and perform electrical measurement to understand their behavior at cryogenic temperatures.

My research entails the following aspects:

Research Aspect	Answer
Use/collect personal data (health data, interviews, surveys)	No
Use/collect experimental data (lab experiments, measurements with instruments)	Yes
Collaborate with industry	Yes
Write/develop software as the main output of the project	No
Use code (as in programming) for data analysis	Yes
Work with large data (images, simulation models)	Yes
Other:	N/A

Reflections on the importance of RDM videos

I found most of the arguments presented in the video very relatable and relevant. Although it was briefly mentioned, I think that discussing what is/should be shared is very relevant (if shaking/stirring a sample is significant, how many more things could be just as important that I have no idea of?). I found particularly interesting the discussion about creating a collaborative environment for reproducibility, as opposed to the idea of a "reproducibility police". For me, it would be a horror story to make some mistake in a paper and having people put in doubt my integrity.

What would you like to learn during this course?

I would like to learn how to decide what should be shared in an online repository when publishing a paper, i.e., how can I know that the information I shared is enough to make the work reproducible?

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before Class 1)
[x] Respond to the GitHub issue on 'data challenge'
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 17 May 13:00). Link: https://surfdrive.surf.nl/files/index.php/s/DUmnuKue9PzvEYa
[x] Provide feedback to at least one Assignment 2 from another participant
[x] Respond to the GitHub discussion on 'licenses'
[x] Respond to the GitHub discussion on 'folder structure'
[x] Respond to this GitHub issue with your readme file
[x] Provide feedback to at least one readme file from another participant
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 31 May 13:00). Link: https://surfdrive.surf.nl/files/index.php/s/U3dBzdTbCBJNMyh
[x] Provide feedback to at least one Assignment 3 from another participant
[x] Assignment 4: Data Management Plan (before Class 2)
[x] Respond to the GitHub discussion on 'Data Management Plans' if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before Class 2)

EstherPlomp commented 1 year ago

Hi @francescozatelli ! Thanks for handing in assignment 2! It looks fantastic and I don't really have feedback for you, so I will focus below on your questions:

How to make sure that someone in the team will still have access to the project data drive? You can add anyone with a netID to the project drive by adding them from the start, asking the service desk to add a member to the drive, or by using UMRA to add members (see the second bullet point on this page for more information about UMRA: https://estherplomp.github.io/TNW-OS-support/posts/storage-solutions/#project-drive)

How can I retrieve data in the U: drive after I leave if needed? As long as you have a netID you can still access the U: drive. When you don't have a (guest)contract anymore I suppose you could ask others to share the data, but an easier way is to share the data that you're working on publicly in a data repository so that you yourself also have access to the data/code after you leave an institute. Due to the way that our contracts are set up the University holds the rights to data - by sharing things publicly you also grant yourself reuse rights!

As for your flag on need to reliably store data for the long term: Data repositories also help with that, since they generally store data for at least 10 years. You'll learn more about that in the upcoming self-paced materials - and you can always ask more questions about this in the next assignment or here on GitHub!

bauerjana commented 1 year ago

Hi @francescozatelli , here comes my feedback for your data flow map. The classification of your data sets in the map is comprehensible and clear. You already have several levels of data storage (local computers and drive storage) and you use Github for the codes - all these routines are very exemplary. Therefore I don't have any critique. Based on my own experience with cleanroom fabrication, I also know that cad drawings for the devices might be very sensitive and important for the fabrication workflow, especially if the layouts are still under development and therefore change iteratively. So maybe it makes sense in your case to include these files in your map as well. Overall, it's a great data flow map!

francescozatelli commented 1 year ago

Thank you both very much for your feedback! Using data repositories makes a lot of sense. The self-paced material was had a couple slides regarding what data to share, so that was interesting.

Indeed CAD drawings should be part of the map, so far we have been storing them on OneDrive (like the SEM images). I will add them in the next data flow maps.

On a different note, here is my readme file.

EstherPlomp commented 1 year ago

Thanks for sharing your READme and Assignment 3 @francescozatelli!

Your README looks very clear and comprehensive: well done! Some minor comments:

Don't forget to list the ORCID's! You can set up your own ORCID and Tom's ORCID provides a better overview of what this looks like once you have some more research outputs.
Do check if NWO wants you to cite a certain research project number in the funding acknowledgement
You can also copy some of the methodological information from the paper in the readme of the data/code

Your Assignment 3 looks mostly clear (see the comment under access/publication), and very comprehensive! Well done!

Documentation

I love how you highlight that the Python modules are 'supposed to be' documented, as well as the red flags here! Very good flags/points to pay attention to indeed!
Publishing Jupyter notebooks will take care of most if not all of the documentation - great practise!

Metadata

Would agree with you that while QCoDeS is indeed as close as one can get to a more 'official' metadata standard!

File formats

.dxf and .gds are indeed proprietary. Alternatives might be .svg or .pdf. That might involve too much information/interoperability loss - so it may not be worth it.

Data Access/Publication

From your comments I'm not entirely sure which part of the data/code you'll publicly share? TU Delft expects the data/code directly underlying the publications/thesis chapters. That does not include all the data indeed - so contact information in the READme file would indeed be important for that. You will indeed save future you more effort the more data/code you can publicly share, because the data repository will do the preservation and management for you :)

francescozatelli commented 1 year ago

Thanks a lot for the feedback @EstherPlomp! For what concerns data access/publication indeed I was not very explicit in Assignment 3, but I have (hopefully) explained it clearly in the readme file.

I understand your point regarding data preservation and management for 'all the data' and it makes a lot of sense to me. However, I'm having a bit of a hard time understanding what exactly 'all the data' would mean from a practical point of view. I'll try to explain my problems with a practical example.

Usually, the data that we collect for a project is associated to a sample (or multiple samples). If samples are entirely dedicated to one specific project from the beginning, I think it's clear that 'all the data' would mean exactly all the datasets collected on such sample(s).

However, sometimes a single sample is used to collect datasets for multiple projects of different people. In this case we would have tens of databases, each containing hundreds of datasets. Even though the projects are somewhat independent, there is always some overlap. One project could start from some measurements taken for another project. It could also be that one project starts but ends up nowhere (although it may be attempted again in the future). Some projects might be paused and continued later on. It could also be that some datasets are measured in a very "explorative" way, so that they are not really related to any project. In general, however, all the measurements contribute to a sort of cumulative "knowledge" of the sample.

Eventually, if a project is successful it becomes a paper. In such a case, what does it mean to share 'all the data'? I suppose I could arbitrarily select a subset of datasets that are the most relevant to my project, but I can't really think of systematic criteria to make this selection (and of course this would just be 'more data', not 'all the data').

I hope I managed to explain my very specific situation in a somewhat understandable way :)

EstherPlomp commented 1 year ago

Absolutely explained in an understandable way!

I'm afraid that it will indeed come down to some arbitrary selection of the data that is most relevant to the main conclusions of the work. In principle you're encourage to publish all of these datasets - but sometimes people are hesitant with this because what if they want to use it for a possible future paper? I would argue that their name is on the dataset already and they have the advantage of knowing the data well / the equipment to do further experiments - but it is in principle possible that then others start working with the data. It might be worthwhile to have a chat with Arthur, or other people from your group, once you have one of your article ready to publish. If you and another person agree on what makes the most sense to share, at least you have some agreement on your arbitrary data selection!

I'm sorry that this isn't a simple answer - but the answer to your question will differ per project and also the preferences of your supervisors/collaborators I'm afraid!

francescozatelli commented 1 year ago

I see, thanks a lot! Understandably it's a very specific problem, so I will have a discussion with my collaborators and/or Arthur then.

EstherPlomp / TNW-RDM-101