Assignment 1 Giacomo Lastrucci

giacomolastrucci commented 1 year ago

Introduction

Hi all, my name is Giacomo, I am Italian and recently started a PhD at TU Delft in Chemical Engineering. I will be working on applying AI and machine learning in the process industry. I am passionate about travelling, hiking and cooking!

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

During my research, I will be working on an alternative modeling technique for chemical processes, placed between traditional mechanistic modeling and novel machine learning methods. Recent deep learning models are able to extract features and learn from massive amounts of data without knowing any physical law. I will try to give these models the ability to learn the physics behind the problems to increase their prediction accuracy.

My research entails the following aspects:

Research Aspect	Answer
Use/collect personal data (health data, interviews, surveys)	No
Use/collect experimental data (lab experiments, measurements with instruments)	No
Collaborate with industry	Yes
Write/develop software as the main output of the project	Yes
Use code (as in programming) for data analysis	Yes
Work with large data (images, simulation models)	Yes
Other:	N/A

Reflections on the importance of RDM videos

I completely agree with the importance of proper data management. Despite can be a little of a pain at the beginning to find the proper organizational setup, it is something that pays in the long term. Apart from "dramatic" horror stories, on a daily basis, having a proper data management setup can save you a considerable amount of time. Having an easy retrieval of data, active backups and proper storing is of primary importance in the research field.

What would you like to learn during this course?

Get the best practices for collecting experimental outcomes from my research (e.g., training runs, employed hyperparameters and network architecture), appropriately store them and retrieve them at a late stage. Also, I would like to get more into how to properly store huge amounts of large data as images.

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before 12 September, 10:00)
[x] Respond to the GitHub issue #65
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 20 September 13:00). Link: Data Flow Map 1
[x] Provide feedback to at least one Assignment 2 from another participant
[x] Respond to the GitHub discussion #66
[x] Respond to the GitHub discussion #67
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 4 October 13:00). Link: Data Flow Map 2
[x] Provide feedback to at least one Assignment 3 from another participant
[x] Assignment 4: Data Management Plan (before 11 October, 13:00)
[ ] Respond to the GitHub discussion #68 if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before 17 October, 17:00)

EstherPlomp commented 1 year ago

Copied from #97:

Hello everyone, please, find here my data flowmap. Feel free to leave any comments or feedback. Thank you!

Best, Giacomo

With feedback from @PJL-vandenberg:

Hey Giacomo,

Looks good! I like the division between the confidential data and the open data. Including training sets that can be public seems like a great idea for transparency. The only question I have is: "What are your plans for code documentation, to prevent bits of code that make sense to you when documented in your way, but might not to your successor?" Because it has happened to me before that somebody handed me 50 lines of code commented with "This makes the dataset from the model data", because the steps made sense to him.

giacomolastrucci commented 1 year ago

Hi @PJL-vandenberg,

Thanks for the feedback. Of course code documentation is crucial for successors, but also for yourself once havening to go through the code after a while. I plan to try to follow some basic good practices in code developing and I guess will be important also to insert comment "line-by-line" whenever needed to understand complicate steps. Thanks again!

EstherPlomp commented 1 year ago

Thanks @giacomolastrucci for sharing assignment 2 - it looks great, well done!

Thanks also @PJL-vandenberg for sharing the struggle you had to go through! Hopefully this will prevent a repeat for any successors the both of you have! The CodeRefinery materials have some pointers for documentation which may help.

as I see confidential data red flags, I just want to point out that TU Delft's GitLab may be a more secure solution for code management of that particular part of your project, since it is managed by TU Delft instead of GitHub/Microsoft.
I don't see specific file formats - file formats are an important theme for assignment 3!
I don't see any file sizes, but I assume that this can run into the TBs of data. I think the Project/U Drive is indeed the best solution for you!

eugenhu commented 1 year ago

Hi Giacomo,

It was interesting to read through your datasets in your assignment 3, thanks for sharing. Everything looks great, but here is some feedback you can consider:

You describe your training data as GB sized .csv files, would a binary format like numpy's .npy or NetCDF be easier to work with?
For documentation/metadata, maybe you've heard of Papers with Code before, but there's a lot of papers shared on there with code that you can get inspiration from for laying out documentation, sharing environments, etc.
Will you use actual Excel .xlsx files for your data dictionary? A lot of spreadsheet apps can open them but it's not as human readable as, for example, a .csv file.

EstherPlomp commented 1 year ago

Thanks for sharing assignment 3 @giacomolastrucci, and for updating the information on the file formats as well! It looks super clear and well structured so I have very little feedback, well done!

Fair enough that you'll need to discuss the data publication part with your supervisor. If you plan to share scripts/code via GitHub, do also share a snapshot of that via a data repository. GitHub is a great place to share code and to collaborate on it with others, but it doesn't have a long term preservation policy and it doesn't assign DOIs. Data Repositories like Zenodo or 4TU.ResearchData do have this, and Zenodo has a nice integration with GitHub. See here for more info: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

And I didn't see a software license preference in your assignment, but we discussed that in #66 https://github.com/EstherPlomp/TNW-RDM-101/discussions/66#discussioncomment-7158917

Thanks for your feedback and also for sharing Papers with Code @eugenhu! I'll check this out as well!

giacomolastrucci commented 1 year ago

Hi @eugenhu , thank you a lot for your comments. Especially, thanks for suggesting going through Papers with Code, I did already know it but I definitely need to check it out in more detail. Also, I will consider using binary files like .npy instead of .csv for storing datasets, even though the latter is even "more" open (for instance for people not using python). For the data dictionary, I don't have a strong opinion yet. I think both can be suitable.

giacomolastrucci commented 1 year ago

Hi @EstherPlomp, thanks a lot for your feedback.

Thanks for suggesting Zenodo, I think it can be really suitable for my purposes, especially thanks to the integration with GitHub.

EstherPlomp / TNW-RDM-101