EuBIC / EuBIC2020

4 stars 0 forks source link

Batch XIC and spectra extraction in ThermoRawFileParser #7

Open caetera opened 5 years ago

caetera commented 5 years ago

Abstract

ThermoRawFileParser is open-source cross-platform software to convert raw files from Thermo MS instruments to open data formats. Common open MS data formats are either "heavy" (XML-based formats, such as mzML) or too simple to include all necessary metadata (text-based formats, such as MGF). There is a need to include more "light-weight" data representation, that can be used in web services and applications. Moreover, it is often necessary to obtain specific information, such as a set of eXtracted Ion Chromatograms (XICs), or spectra with certain properties from a data file rather than converting it completely. This project aims to resolve this issue by developing a tool for batch retrieval of XICs and spectra in JSON format using existing codebase of ThermoRawFileParser.

Work plan

1. Group discussion/brain-storming

The following issues have to be discussed, however, the list can be extended during the discussion. The brain-storming can begin before the meeting.

As a result, we should come to a detailed specification of input and output and key design concepts for the tool to be developed.

2. Drafting the roadmap of development

We should start with review/refresh the existing codebase of ThermoRawFileParser. Later, using the specification developed earlier, we should develop the roadmap for the features to be implemented, starting from the most important (easy to implement) to least important (complicated). For example, as a start, we will focus on batch retrieval of m/z based XICs, with the future plan to include chemical formula based ones.

3. Building a working prototype.

We start with a prototype tool that will implement the most important features. Depending on the number of participants the work can be done in parallel with small groups focusing on isolated tasks, such as input parsing, output formatting, XIC creation, spectra filtering and retrieval.

4. Improving the prototype.

Depending on the available time/resources, we will continue adding new features to the tool according to the roadmap. Additionally, we have to agree on how to continue with the development after the end of the hackathon.

5. (Bonus) Working on JSON representation of mass spectral data

If time will allow, we can discuss/draft the format for JSON representation of the complete raw file. Partially (representation of mass spectra), this should be discussed during stage 1, we should build on it to formulate a draft for JSON-based MS data format. It is unlikely that we will be able to provide the complete specification during the hackathon, however, we can present it as draft open for public discussion.

The results of the hackathon, i.e. the specifications from stage 1, roadmap from stage 2, and code from stages 3 and 4, will be published on GitHub, possibly as a separate branch inside ThermoRawFileParser repository.

The draft of JSON-based MS data format should be published as a separate repository available for comments and suggestions.

Technical details

Contact information

Vladimir Gorshkov, University of Southern Denmark, vgor(at)bmb.sdu.dk Niels Hulstaert, Ghent University, niels.hulstaert(at)ugent.vib.be Yasset Perez-Riverol, EMBL-EBI, ypriverol(at)gmail.com

cpanse commented 5 years ago

We (@rolivella and me) developed a simple prototype performing this task during a core4life micro hackathon (about 4 h). The idea was to have proof-of-concept code for feeding XICs into the http://qcloud2.crg.eu/ system.

https://github.com/coreforlife/c4lProteomics/tree/master/RawFileReader-XIC-json

caetera commented 5 years ago

Hi @cpanse, thank you for letting us know about your prototype. We will look into your code to have some inspiration.