Open di-hardt opened 2 years ago
Dear @di-hardt,
I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday. Would be nice if Dominik and David could leave a short comment so they become part of this issue.
Best, Tobi
Very great news. Thanks @tobiasko
I'm looking forward 😃
Agreed, very good news, thanks @tobiasko !
Seeing forward to it.
This sounds very interesting!
@david-bouyssie please let me know if you'd like me to yank my existing crates that might introduce confusion for your naming scheme.
https://github.com/mobiusklein/mzdata https://github.com/mobiusklein/mzpeaks https://github.com/mobiusklein/mzsignal https://github.com/mobiusklein/mzdeisotope
I do hope to get back to them some day but my current project is becoming an arcology.
mzdata
may have some useful ideas for mzML reading and writing.
Hi @mobiusklein
Thank you for your message. It will be indeed very useful to discuss this point with you.
Some quick update: Today I presented the project in HUPO Cancun at the bioinformatics hub. Some weeks ago Michael Lazaer ( @lazear ), leaving in San Diego joined the group of project contributors. Would you like to participe to the future meetings?
If you wish so, I think the easiest way to organize things would be that you join our eubic Slack workspace, and then register to our channel.
I can send you an invitation link if you want.
I can't promise to be a very productive participant right now, but if you can send me an invitation, I'll gladly listen in when I can.
Hello everyone,
I just created a slack workspace for the DevMeeting and a channel named rusteomics for this hack. You should receive an invite to join by email.
Best, Tobi
Summary of the Rusteomics hackathon
The aim of Rusteomics is to build, in the Rust programming language, a collaborative and community-driven toolbox to process efficiently mass spectrometry data. This hackathon was the ideal opportunity to kick off the project and to work on a few small tasks that could show the benefit of the technology. During the hackathon week, we thus developed an application that can generate a spectral library (NIST MSP format) from a given peak list file (MGF format) and a given list of peptide-spectrum match (PSM) information (psm_utils TSV format). It effectively reannotates each MS/MS spectrum according to the previously matched peptide sequence (as it would be obtained from a proteomics search engine). For the sake of simplicity, we did not consider modified peptides, but we plan to remove this limitation in the near future. We also benchmarked some IO components (FASTA and MGF readers) that were finalized during this hackathon. According to our preliminary results, the Rust implementations can be ~2-10 times faster than equivalent Python code. We also tried to bind the C# ThermoRawFileReader, but were lacking time to obtain stable results. Finally, we discussed and worked on the various organizational aspects of the project to set up a framework that ensures the best conditions for collaborative development. More specifically, we set up a skeleton of software modules on dedicated repositories, configured a dedicated GitHub organization, and created project-related files (software license, code of conduct, readme files…). The foundations of the Rusteomics EuBIC-MS project have now been established and thanks to its new group of maintainers, new features can be added soon.
Title
Rusteomics
Abstract
The proteomics community created some exceptional toolboxes over the past years, like OpenMS, Pyteomics or mzR. Most of these toolboxes implement general computational tasks, like reading and preprocessing data. However, most of them do not rely on a mutual code base or the same internal data representation, which makes interoperability only possible by using (PSI) standard formats. Reading/writing them without a common implementation may introduce another layer of errors. The aim of Rusteomics is to build a collaborative community-driven toolbox, which provides read and write access to the most common file formats, as well as low-level and well established algorithms like (deisotoping, deconvolution, MS/MS spectra annotation, etc.). While similar solutions exist in various programming languages, this project will be the opportunity to tailor these new components to be highly compatible with scientific (scripting) languages like Python / R. Moreover, the reimplementation in Rust should bring some major benefits: The modern compiler and building system makes Rust-based projects easier to maintain than C++-based projects, while providing the same performance. During this hackathon, we will refine the goals/organization of the project, and start the development of a tool that can be used to generate spectral libraries (MSP format writer).
Project Plan
Technical Details
The base implementation of Rusteomics will be written in Rust, while the language bindings may relay on the targeted language.
mzio
-cratereader
- &&writer
-moduleContains reader and writer classes for proteomic specific files. The
readers
-module should contain a sub-module calledvendor
which contains read support for vendor formats like the Mascot.dat
-format. It may be necessary to add write capabilities for some vendor formats to exchange data with these related tools.entities
- ormodels
-moduleContains internal representation of different data types, e.g. a spectrum or a amino acid sequence. Defining this representation is a crucial task to be able to handle mandatory and optional data for each supported format. These data structures are created by the use of
io.reader.*
and should be used to create files withio.writer.*
.mzcore
-cratechemistry
-moduleHere, some constants and functionality are implemented to deal with molecules, e.g. amino acids representation (name, mass, one letter code, chemical representation etc.), losses, maybe atom representation, etc.
function
- oralgorithms
-moduleThis module will contain all processing and analytic functions used in proteomicsprotomics:
Additional crates
In addition to the proteomics related crates, researchers of other omic-fields (Transcriptomics, Metabolomics, ...) are welcome to contribute crates of their own to make Rusteomics truly usable for 'multi-omics' studies.
Language bindings
Each crate repository will contain several sub folders, each containing a specific language binding, e.g. for mzcore
R-bindings
Most popular languages for statistical analyses. based on rextendr
Python-bindings
Offers support for multiple famous toolkits for Deep Learning (Keras, PyTorch), Machine Learning (scikit), data handling (Pandas, Numpy) and web development (Django, Flask). based on pyo3
CPP-bindings
Still one of the fastest languages used in different proteomic-software, e.g. OpenMS. based on rust-bindgen OR rust-diplomat https://github.com/rust-diplomat/diplomat/
Java-bindings
Java interoperability will give the opportunity to be compatible with several programming languages running on the JVM (Groovy, Clojure, Jython, Kotlin, Scala). based on JNI, JNR or the new Foreign-Memory Access API
C#-bindings
One of the main languages for Microsoft based systems and used in many projects, also in vendor software, Rusteomics may benefit from C# bindings as well. Bindings can be created by DNNE or netcorehost
Contact Information