Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

MS2extract #3398

Closed DanielQuiroz97 closed 2 weeks ago

DanielQuiroz97 commented 7 months ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 7 months ago

Hi @DanielQuiroz97

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: MS2extract
Type: Package
Title: Targeted MS/MS Spectra Extraction for In-house Compound Library Creation
Version: 0.99.0
Date: 2023-10-13
Authors@R: c(
    person("Cristian", "Quiroz-Moreno",
 email = "cristianquirozd1997@gmail.com",
 role = c("aut", "cre"),
 comment = c(ORCID = "0000-0002-9069-9147") ),
    person("Jessica", "Cooperstone",
 email = "cooperstone.1@osu.edu",
 role = c("aut"),
 comment = c(ORCID = "0000-0001-7920-0088") )
 )
Maintainer: Cristian Quiroz-Moreno <cristianquirozd1997@gmail.com>
Description: A mass spectrometry toolbox with the objective of importing mass
   spectrometry data, and extracts the desired MS/MS spectra for in-house MS/MS
   compound library creation. Subsequently, users can export the spectra 
   in a .msp file format, as a resulting compound database for further use
   in compound identification tasks. It is worth to note that this package
   do not annotate mass spectrometry data.
License: MIT + file LICENSE
LazyData: true
RdMacros: Rdpack
biocViews: MassSpectrometry, Metabolomics
Depends:
     R (>= 4.2.0)
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Encoding: UTF-8
Imports: 
    cli,
    Rdpack,
    crayon,
    dplyr,
    ggplot2,
    ggpubr,
    ggrepel,
    magrittr,
    MSnbase,
    OrgMassSpecR,
    ProtGenerics,
    purrr,
    Rdisop,
    rlang,
    readxl,
    readr,
    ggsci
Suggests:
    knitr,
    rmarkdown,
    prettydoc,
    testthat (>= 3.0.0),
    Rcpp (>= 1.0.9),
    BiocStyle,
    RefManageR,
    sessioninfo
VignetteBuilder: knitr
Config/testthat/edition: 3
BugReports: https://github.com/CooperstoneLab/MS2extract/issues
URL: https://cooperstonelab.github.io/MS2extract/
lgatto commented 5 months ago

Thank you for your submission. I have some general comments, before addressing the technical review of the package.

One important aspect we take into account when considering new packages is how they make use of existing data infrastructure. I see that you used MSnbase::readMsData() to parse the raw (mzML, mzXML) data using (by default) the in-memory backend.

  1. Are you aware of the R for Mass Spectrometry packages (RforMS), and in particular the Bioconductor Spectra package? Spectra, and the other RforMS packages represent are developed by the core MSnbase team, re-using what worked well in MSnbase, and replacing/improving what didn't work that well. MSnbase isn't going to be deprecated any time soon, but I would suggest to use Spectra::Spectra() rather than MSnbase::readMsData().

  2. I see that you are exporting data to mgf and msp. This functionality is available as part of RforMS, and more specifically the MsBackendMgf and MsBackendMsp backends, that integrate directly with Spectra.

  3. Will you approach scale to larger dataset? Using an in-memory approach and converting the data into a tibble won't work for modern shotgun data. The small example data you provide is only composed of 9 MS2 scans, and generates a tibble with 24249 rows. Have you considered using our dedicated classes that would (1) make it possible to handle larg data and (2) benefit from the existing input and output capabilities (see point 2).

  library(Spectra)
  ProcA2_file <-
      system.file("extdata",
                  "ProcyanidinA2_neg_20eV.mzXML",
                  package = "MS2extract")
  ## load the raw data, extract based on retention time and precursor
  ## MZ, ...
  Spectra(ProcA2_file) |>
      filterRt(c(163, 180)) |>
      filterPrecursorMzRange(c(575.119, 575.120)) ## |> ...
  1. As a side note, using the RforMS infrastructure will provide you with lazy and parallel evaluation for free.
lshep commented 3 months ago

@DanielQuiroz97 did you have any responses and updates to the comments above?

DanielQuiroz97 commented 3 months ago

Dear All,

Thank you so much for taking the time to go over MS2extract. Please find my comments for the presented topics:

Please let me know if these comments provide more clarification about MS2extract implementation decisions. I am more than happy to provide more information.

I hope I was able to provide the novelty and MS2extract and the potential positive impact in data democratization of MS/MS libraries.

Best, Daniel

jorainer commented 3 months ago

just chiming in to clarify - MsBackendMgf has full GNPS support. The spectraVariableMapping() allows to define how certain spectra variables from Spectra are translated/renamed to MGF data fields. This allows to create GNPS compliant MGF files.

But yes, MsBackendMgf does not support batch upload of data to GNPS.

jorainer commented 3 months ago

Talking about "in house library creation" - maybe it would also be worthwhile to have a look at the CompounDb Bioconductor package - that package allows creation, managing and use of (in house and/or public) small compound annotation databases. And it has full support for the Spectra infrastructure. - including access of the full data as a Spectra object and related export of MS2 data as MGF file(s).

I understand that you want to adhere to tidy concepts, but maybe a combination of re-using the (IMHO) powerful Spectra-based framework and tidy concepts might be preferable/more powerful? This could reduce the need to re-implement existing functionality.

lgatto commented 3 months ago

Thank you, @jorainer, for clarifying the GNPS point.

Regarding:

In contrast with the suggested packages (mzR, MSnbase) that are based on OOP, I am continuing tidy data principles first proposed in the TidyMass project (link). I am conscious that when dealing with MS data, using tidy data might not be the best choice, but. MS2extract is designed to select only one scan per analyte (reference MS/MS spectra). Therefore reducing redundancy and not keeping everything in memory.

The Bioconductor project emphasises the systematic re-use S4 classes, to promote interoperability and data consistency. This is an important request we have for new packages, and is paramount for Bioconductor packages. We feel that packages that prefer not to use these classes are probably a better fit for CRAN or, in your case, the tidymass project.

lgatto commented 2 weeks ago

Hi @DanielQuiroz97 - I will close this issue for now due to inactivity. Feel free to re-open it if needed to continue the discussion and/or if you think a contribution to the Bioconductor project is still relevant.