EuBIC / EuBIC2023

EuBIC 2023 developer's meeting
https://eubic-ms.org/events/2023-developers-meeting/
13 stars 1 forks source link

Rusteomics - Community driven toolbox for omic-research #10

Open di-hardt opened 2 years ago

di-hardt commented 2 years ago

Title

Rusteomics

Abstract

The proteomics community created some exceptional toolboxes over the past years, like OpenMS, Pyteomics or mzR. Most of these toolboxes implement general computational tasks, like reading and preprocessing data. However, most of them do not rely on a mutual code base or the same internal data representation, which makes interoperability only possible by using (PSI) standard formats. Reading/writing them without a common implementation may introduce another layer of errors. The aim of Rusteomics is to build a collaborative community-driven toolbox, which provides read and write access to the most common file formats, as well as low-level and well established algorithms like (deisotoping, deconvolution, MS/MS spectra annotation, etc.). While similar solutions exist in various programming languages, this project will be the opportunity to tailor these new components to be highly compatible with scientific (scripting) languages like Python / R. Moreover, the reimplementation in Rust should bring some major benefits: The modern compiler and building system makes Rust-based projects easier to maintain than C++-based projects, while providing the same performance. During this hackathon, we will refine the goals/organization of the project, and start the development of a tool that can be used to generate spectral libraries (MSP format writer).

Project Plan

  1. Define goals of the project and establish short- & long-term goals - 0.25 day
  2. Define the project organization (coding rules, licensing, ...) - 0.25 day
  3. Implementation of an MGF writer (1 day)
  4. Implement and MSP-writer by extending the MGF writer to add spectra annotations, the annotation functionality will be provided by David Bouyssié. The MSP-writer is a current community demand and will help the development of new search engines. (1 day)
  5. Investigate mzIdent (.mzid) to MSP

Technical Details

The base implementation of Rusteomics will be written in Rust, while the language bindings may relay on the targeted language.

mzio-crate

reader- && writer-module

Contains reader and writer classes for proteomic specific files. The readers-module should contain a sub-module called vendor which contains read support for vendor formats like the Mascot .dat-format. It may be necessary to add write capabilities for some vendor formats to exchange data with these related tools.

entities- or models-module

Contains internal representation of different data types, e.g. a spectrum or a amino acid sequence. Defining this representation is a crucial task to be able to handle mandatory and optional data for each supported format. These data structures are created by the use of io.reader.* and should be used to create files with io.writer.*.

mzcore-crate

chemistry-module

Here, some constants and functionality are implemented to deal with molecules, e.g. amino acids representation (name, mass, one letter code, chemical representation etc.), losses, maybe atom representation, etc.

function- or algorithms-module

This module will contain all processing and analytic functions used in proteomicsprotomics:

Additional crates

In addition to the proteomics related crates, researchers of other omic-fields (Transcriptomics, Metabolomics, ...) are welcome to contribute crates of their own to make Rusteomics truly usable for 'multi-omics' studies.

Language bindings

Each crate repository will contain several sub folders, each containing a specific language binding, e.g. for mzcore

mzcore
|- mzcore-rs        (the rust implementation)
|- mzcore-r         (R-binding)
|- mzcore-python    (Python-binding)
|- ...

R-bindings

Most popular languages for statistical analyses. based on rextendr

Python-bindings

Offers support for multiple famous toolkits for Deep Learning (Keras, PyTorch), Machine Learning (scikit), data handling (Pandas, Numpy) and web development (Django, Flask). based on pyo3

CPP-bindings

Still one of the fastest languages used in different proteomic-software, e.g. OpenMS. based on rust-bindgen OR rust-diplomat https://github.com/rust-diplomat/diplomat/

Java-bindings

Java interoperability will give the opportunity to be compatible with several programming languages running on the JVM (Groovy, Clojure, Jython, Kotlin, Scala). based on JNI, JNR or the new Foreign-Memory Access API

C#-bindings

One of the main languages for Microsoft based systems and used in many projects, also in vendor software, Rusteomics may benefit from C# bindings as well. Bindings can be created by DNNE or netcorehost

Contact Information

tobiasko commented 2 years ago

Dear @di-hardt,

I am happy to inform you that your proposal has been selected for the DevMeeting2023! Participants will decide which hackathon to join after the pitch on Monday. Would be nice if Dominik and David could leave a short comment so they become part of this issue.

Best, Tobi

david-bouyssie commented 2 years ago

Very great news. Thanks @tobiasko

I'm looking forward 😃

di-hardt commented 2 years ago

Agreed, very good news, thanks @tobiasko !

Seeing forward to it.

lazear commented 2 years ago

This sounds very interesting!

mobiusklein commented 1 year ago

@david-bouyssie please let me know if you'd like me to yank my existing crates that might introduce confusion for your naming scheme.

https://github.com/mobiusklein/mzdata https://github.com/mobiusklein/mzpeaks https://github.com/mobiusklein/mzsignal https://github.com/mobiusklein/mzdeisotope

I do hope to get back to them some day but my current project is becoming an arcology.

mzdata may have some useful ideas for mzML reading and writing.

david-bouyssie commented 1 year ago

Hi @mobiusklein

Thank you for your message. It will be indeed very useful to discuss this point with you.

Some quick update: Today I presented the project in HUPO Cancun at the bioinformatics hub. Some weeks ago Michael Lazaer ( @lazear ), leaving in San Diego joined the group of project contributors. Would you like to participe to the future meetings?

If you wish so, I think the easiest way to organize things would be that you join our eubic Slack workspace, and then register to our channel.

I can send you an invitation link if you want.

mobiusklein commented 1 year ago

I can't promise to be a very productive participant right now, but if you can send me an invitation, I'll gladly listen in when I can.

tobiasko commented 1 year ago

Hello everyone,

I just created a slack workspace for the DevMeeting and a channel named rusteomics for this hack. You should receive an invite to join by email.

Best, Tobi

david-bouyssie commented 1 year ago

Summary of the Rusteomics hackathon

The aim of Rusteomics is to build, in the Rust programming language, a collaborative and community-driven toolbox to process efficiently mass spectrometry data. This hackathon was the ideal opportunity to kick off the project and to work on a few small tasks that could show the benefit of the technology. During the hackathon week, we thus developed an application that can generate a spectral library (NIST MSP format) from a given peak list file (MGF format) and a given list of peptide-spectrum match (PSM) information (psm_utils TSV format). It effectively reannotates each MS/MS spectrum according to the previously matched peptide sequence (as it would be obtained from a proteomics search engine). For the sake of simplicity, we did not consider modified peptides, but we plan to remove this limitation in the near future. We also benchmarked some IO components (FASTA and MGF readers) that were finalized during this hackathon. According to our preliminary results, the Rust implementations can be ~2-10 times faster than equivalent Python code. We also tried to bind the C# ThermoRawFileReader, but were lacking time to obtain stable results. Finally, we discussed and worked on the various organizational aspects of the project to set up a framework that ensures the best conditions for collaborative development. More specifically, we set up a skeleton of software modules on dedicated repositories, configured a dedicated GitHub organization, and created project-related files (software license, code of conduct, readme files…). The foundations of the Rusteomics EuBIC-MS project have now been established and thanks to its new group of maintainers, new features can be added soon.