Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

methyLImp2 #3035

Closed annaplaksienko closed 8 months ago

annaplaksienko commented 1 year ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

bioc-issue-bot commented 1 year ago

Hi @annaplaksienko

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: methyLImp2
Title: Missing value estimation of DNA methylation data
Version: 0.99.0
Authors@R: c(
    person("Pietro", "Di Lena", 
    role = c("aut"),
    comment = c(ORCID = "0000-0002-1838-8918")),
    person("Anna", "Plaksienko", 
    email = "anna@plaxienko.com", 
    role = c("aut", "cre"),
    comment = c(ORCID = "0000-0001-9607-057X")))       
Description: This package allows to estimate missing values in DNA methylation data. methyLImp method is based on linear regression since methylation levels show a high degree of inter-sample correlation. Implementation is parallelised over chromosomes since probes on different chromosomes are usually independent. Mini-batch approach to reduce the runtime in case of large number of samples is available.
Imports: parallel, dplyr, stats
Depends: R (>= 2.10)
URL: https://github.com/annaplaksienko/methyLImp2 
License: GPL-2
Encoding: UTF-8
RoxygenNote: 7.2.3
biocViews: DNAMethylation, Microarray, Software, MethylationArray, Regression
Suggests: 
    knitr,
    rmarkdown
VignetteBuilder: knitr
vjcitn commented 1 year ago

From vignette:

 Be default, the only two things you have to provide are the data matrix and the type of the data: 450K or EPIC. But here are a few other things you can tune so that the method suits your needs best:

please see contributions.bioconductor.org -- when multiple samples are in use we need to have sample-level information and feature-level information and these are coordinated in a SummarizedExperiment or derivate class.

annaplaksienko commented 1 year ago

Hi! Sorry for the silly question, my first time. Do I understand it correctly that since my package is in pre-review and therefore not yet on git.bioconductor.org, I can't yet bump up the changes there, right? Otherwise I've fixed the issues you've mentioned.

lshep commented 1 year ago

Corret. Please push the changes to github and we will re-examine the package. Once its precheck passed we will process, add to git, and assign a reviewer for a more formal review.
Cheers!

annaplaksienko commented 1 year ago

Ok, thanks! Pushed the changes here https://github.com/annaplaksienko/methyLImp2 šŸ‘

annaplaksienko commented 1 year ago

Hi! Just wanted to check if you need more time to check or if I did something wrong and it's not obvious I'm ready for the check :))

vjcitn commented 1 year ago
A subset of GSE199057 dataset for vignette demonstration
Description
The GSE199057 Gene Expression Omnibus dataset contains 68 mucosa samples from non-colon-cancer patients, from which we randomly sampled 24. Methylation data were measured on EPIC arrays and after removal of sex chromosomes and SNPs loci, it contains 816 126 probes. We have subset only probes from two chromosomes (18 and 21) for the sake of demonstration.

Usage
data(beta)
Format
A data frame

Value
A data frame.

I see github comments about using SummarizedExperiment but no illustration of how it works. Why are you supplying 450k and EPIC annotation, those already exist in Bioconductor packages, e.g. https://bioconductor.org/packages/release/data/annotation/html/IlluminaHumanMethylationEPICanno.ilm10b2.hg19.html

annaplaksienko commented 1 year ago

Both decisions were made for the ease of the package and installation. A data frame/matrix is lighter and easier object for demonstration in the vignette. User can provide SummarizedExperiment will in exactly the same way as they provide a matrix. Same for annotation: Bioconductor package is quite heavy with many dependencies, while only the match between CpGs and chromosomes is needed. Personally, I ran into installation problems just now.

annaplaksienko commented 1 year ago

Now when I think about it, it's fair to ask for SummarizedExperiment in the vignette, I'll changed it on the weekend. But let me know how crucial is to use the package for annotation. As I said, I was aware of it and just preferred to go for a lighter option.

annaplaksienko commented 1 year ago

Hi! I've changed an example data in the vignette to SummarizedExperiment. Let me know what you think about it. Also, please let me know if it's possible to keep the annotation as the package data frame instead of adding a dependency!

annaplaksienko commented 1 year ago

Hi! I just wanted to check if you have any feedback :)

annaplaksienko commented 1 year ago

Hi! Are there any updates? Thanks!

lshep commented 1 year ago

I was away last week; I will process momentarily to get a reviewer assigned

bioc-issue-bot commented 1 year ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/methyLImp2 to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

annaplaksienko commented 1 year ago

Hi! I got two errors: one is that I'm not subscribed to the mailing list, but I am šŸ¤” And I'm confused about the tarball size, could you please clarify that for me?

annaplaksienko commented 1 year ago

Hi! Just wanted to ask if it's possible to get a feedback on my errors :) Thanks!

lazappi commented 1 year ago

Sorry, I've had some unexpected things come up so I've been slower to reply than normal. For the mailing list please double check that you are subscribed with the same email as listed in the DESCRIPTION. You can see the tarball size by running R CMD build on the command line (or maybe something like devtools::build()). I'm not sure what the Bioconductor limit is off the top of my head but it should be in the developer documentation (probably a few MB). Please also try to address as many notes in the report as you can.

bioc-issue-bot commented 1 year ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 55312cbd8cb2e4a1d3a8965df1f2b47060e44336

bioc-issue-bot commented 1 year ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/methyLImp2 to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

annaplaksienko commented 1 year ago

Hi! I've found that each file has to be less than 5MB but nothing about the total tarball size šŸ¤” Please let me know if you find anything!

I also tried to reduce the number of notes. About unaddressed ones:

1) There are ::: calls to the package's namespace in its code. This was done to avoid a warning/note (I don' remember) about calling library at the core when parallel computing. Is there a better way to do this?

2) The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.

Technically, if you don't consider ignore documentation and comments, only one function is too long (methyLImp). Should I split that one?

3) Consider multiples of 4 spaces for line indents; 81 lines (8%) are not.

Could you please give me advice on how to find these lines? I've fixed a bunch but apparently not everything.

And for unit tests I'll need some more time if it's unavoidable :))

lazappi commented 1 year ago

Hi! I've found that each file has to be less than 5MB but nothing about the total tarball size šŸ¤” Please let me know if you find anything!

I think the overall package size should also be 5MB, the developer book says:

The source package resulting from running R CMD build should occupy less than 5 MB on disk.

See https://contributions.bioconductor.org/general.html?q=size#package-size

I will double check this though because I'm not sure it matches up with the 5MB limit on individual files.

  1. There are ::: calls to the package's namespace in its code. This was done to avoid a warning/note (I don' remember) about calling library at the core when parallel computing. Is there a better way to do this?

I think we want to avoid this if possible, I would have to check during the review to see if I have a suggestion. We strongly encourage developers to use the {BiocParallel} package for parallel processing so please look at switching to that if you aren't using it already.

  1. The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.

Technically, if you don't consider ignore documentation and comments, only one function is too long (methyLImp). Should I split that one?

I'll check during the review but it's probably fine, that's more of a guideline

  1. Consider multiples of 4 spaces for line indents; 81 lines (8%) are not.

Could you please give me advice on how to find these lines? I've fixed a bunch but apparently not everything.

This can often be the documentation comments or other cases where it makes sense to indent by a different number so probably fine to ignore when it's this low.

And for unit tests I'll need some more time if it's unavoidable :))

This is also very strongly encourages so please try to add tests if you can.

annaplaksienko commented 1 year ago

Got it, thank you! Ok, I'll work on the unit test while you'll checking other stuff!

lazappi commented 1 year ago

Hi @annaplaksienko

Thanks for submitting methyLImp2 :tada:! Below is my review of your package. Please reply here if anything is unclear or needs any further explanation. I have repeated some of the things we already discussed so they are in one place.

What next?

Please address the comments as best as you can. When you are ready for me to check the package again please reply to let me know with a summary of changes you have made or any other responses. You can use the "Quote reply" option in the ... menu on this comment to reply directly to my points below.

Luke

Review

Key: :rotating_light: Required :warning: Recommended :green_circle: Optional :question: Question

General package development

DESCRIPTION file

Package data

Documentation

README

Vignette

Man pages

Unit tests

Code

R

Third-party code

annaplaksienko commented 1 year ago

Hi! Thank you for such a detailed review! I'll get to work šŸ‘

annaplaksienko commented 1 year ago

Hi @lazappi! A question about BiocParallel: can I ask what's the motivation behind switching to it? It's not a problem to do, I've implemented it, but it increases my running time. As you can see inside the package, I apply parallelization to numeric matrices, not Bioconductor structures. So I'm just curious to know whether BiocParallel is necessary.

lazappi commented 1 year ago

I believe the motivation behind asking people to use {BiocParallel} is that it provides consistency between packages, i.e. they all have the same interface for controlling parallel processing. The other thing is that it allows the user to control exactly how parallelization is done which is important when packages will be run on different operating systems, servers etc. as some things don't work on all systems.

I'm not really an expert on the details but it shouldn't slow things down too much as it is just a wrapper around the normal parallel processing systems. Maybe the overhead is bigger than I thought though (if you are using the same underlying parallelization is both cases)?

annaplaksienko commented 1 year ago

Yep! I use socket in both cases so that it runs on Windows too.

lazappi commented 1 year ago

Just checking in to see how you are going addressing the comments and to let you know the dates for the next release are now available https://bioconductor.org/developers/release-schedule/.

annaplaksienko commented 1 year ago

Hi Luke, thanks for checking in! I've finished working on almost all of your comments but at the same time I got feedback for the accompanying paper so now I'm thinking whether I should change something in the package. I'll let you know as soon as I'm done!

annaplaksienko commented 1 year ago

Hi Luke! Just letting you know not to give up on me, I'm almost there :)

lazappi commented 1 year ago

The deadline for adding new packages is Wednesday so you would probably need an update today for me to have any chance of looking at it (and I'm not sure I can guarantee that). I don't want to put pressure on you but just making sure you are aware what the situation is.

annaplaksienko commented 1 year ago

If that's fine with you, I;m totally fine with being added to spring edition later. Would that work?

lazappi commented 1 year ago

Yeah, that's totally fine if you want to put it off until the next release.

lshep commented 1 year ago

@annaplaksienko should we close this issue for now until you are ready to push updates?

annaplaksienko commented 1 year ago

Almost there! What would closing the issue mean/what are possible drawbacks?

lshep commented 1 year ago

nothing really. except you would want the issue to be reopened before you pushed updates to trigger a build report else you would likely have to do an arbitrary version bump to trigger a build report for re-review.

lshep commented 1 year ago

but if you think it will be updated in the next two weeks we can leave it open for now?

annaplaksienko commented 1 year ago

Ok, got it! Yes, let's keep it, I'll push through!

annaplaksienko commented 11 months ago

Hi! I have to say I'm still struggling with the size of the package. I'm over the requirements because I have methylation annotation in my package. Not even detailed annotation: it's just a data frame with probes in one column and chromosomes in the second one. But since there are 850K probes and the names are characters, the file is too heavy.

As I understand, instead I'm supposed to make my package rely on IlluminaHumanMethylationEPICmanifest (and same for 450K). And I suppose people who are working with methylation data probably already have it. But if not, it take s AGES to install, if it does (I had multiple experience with errors). So I'd like to avoid that... But I need annotation.

Is it possible at all to be over the size limit for the package?

lazappi commented 11 months ago

The core people can confirm but I think the size limit is pretty strict for software packages. I'm not sure what kind of installation issues you have had but these core annotation packages should be reliably available. Apart from the size issue there are other advantages to reusing the annotation, you don't have to worry about maintaining it and your package should automatically benefit if something changes in the annotation.

lshep commented 11 months ago

I would be curious to know more about your installation issues as well as this should not be the case; I just installed in a fresh R so it had to install many dependencies too and it took 3.5 mins to install all dependencies and the illumina package.
Is it possible to use a smaller sample size in the package to be able to truncate/filter the data provided?
Another option is to host the data in AnnotationHub (in this case probably setting up your current package instead of hosting an entirely separate package for it) and/or on something like Zenodo and pull down/cache the data only when examples/vignettes are run?

annaplaksienko commented 10 months ago

Hi! The problem is not examples or vignettes but the data that is used in the main method. The input data is first split over chromosomes according to the annotation. Since the only thing needed is the match probes-chromosome, so basically two columns, it doesn't make a lot of sense to me to use whole Illumina manifest package. That one imports minfi, that has many dependencies, so that may be problematic. One can, of course, argue, that people who work with methylation data may already have minfi, so that won't be a problem, but not everyone uses that package. Therefore I thought that adding a data with those two column, probe-chromosome, is easier and more beneficial.

I am trying to rework around Illumina manifest package.

annaplaksienko commented 10 months ago

I looked into Illumina EPIC manifest package and I can't see any match between probes and chromosomes there (although it is present in the csv file on the website that I used). Maybe I'm doing something wrong?... Let me know if you can access it!

annaplaksienko commented 10 months ago

Maybe AnnotationHub is the best solution šŸ¤” Do I understand it correctly that storing the data on my Github is not good enough for making such package?

lshep commented 10 months ago

correct. we do not allow data to be stored on github. It must be in a trusted source like zenodo, S3, microsoft data lakes, etc.

annaplaksienko commented 10 months ago

Ok, thanks! Working on it!

bioc-issue-bot commented 9 months ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 7f7d4d276f757143ad340edc39cb400080681976

bioc-issue-bot commented 9 months ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): methyLImp2_0.99.4.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/methyLImp2 to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

annaplaksienko commented 9 months ago

Hi @lazappi! Finally, I'm done with the review (at least a hope so). Thank you again for really looking into details! I'm looking forward to discussing the package further.

General package development

  • [ ] šŸšØ We already discussed the package size but I have confirmed the limit for a built package is 5 MB. Please try to reduce the size of the package below this.

Fixed that! I now use annotation from an existing Bioconductor package instead of internal one and I reduced the size of the dataset for the vignette.

DESCRIPTION file

  • [ ] ā“ Is there already a {methyLImp} package? I'm wondering if an existing package should be updated instead or whether this might cause confusion for users.

Good point! We'd like to keep the first version as the separate package since it is linked in two previous papers + we use that code when we compare running time and performance of v1 and v2. However, you're totally right about possible confusion! So on {methyLImp} Github page we have added a README file with the recommendation to use {methyLImp2} instead. What do you think about it?

  • [ ] šŸŸ¢ It is recommended to add a BugReports field. This is usually a link to the GitHub issues page or the Bioconductor support forum.

Done!

  • [ ] šŸŸ¢ It is recommended to add a URL field. This is usually a link to the GitHub repository.

There is one already. Is there something wrong with it?

  • [ ] šŸŸ¢ It looks like you only import the {dplyr} package for one use of the distinct() function. You may want to consider if this can be done another to way to avoid the (rather large) dependency.

Thanks a lot, great point! It was actually easy to re-write in base R so I removed that dependency.

Package data

  • [ ] āš ļø Some of the included data is quite large which I think is why the built package is over the limit. Here are some things to consider to try and reduce the dataset size.

    • Check the data is compressed
    • Can the example dataset be reduced in size? Maybe by removing some of the samples/variables?
    • Is there an existing Bioconductor package that contains the array annotation data? If so I would recommend depending on that which would reduce the size of your package and give you one less thing to update and maintain.

Done! Reduced the example dataset in size and switched to ChAMPdata Bioconductor package for the annotation.

  • [ ] āš ļø Please provide a URL to the repisitory containing the pre-processing in the function documentation. It would also be good if you briefly described what was done.

Done! Both for the links and brief description.

Documentation

  • [ ] šŸŸ¢ There were a few typos in different places, the {spellcheck} can be used to check for these.

Thank you, I didn't know about that!

README

  • [ ] šŸšØ Please include Bioconductor installation instructions in the README

Done!

Vignette

  • [ ] šŸšØ Please use the {BiocStyle} package for formatting the vignette

Done! Question: in the vignette "author" field, do I specify author of the vignette, authors of the package or authors of the method?

  • [ ] šŸšØ Please include a table of contents in the vignette

Done!

  • [ ] šŸšØ Please include an Introduction section. This should include the motivation for inclusion in Bioconductor and a comparison to similar packages (if applicable)

Done!

  • [ ] šŸšØ Please include an Installation section. This should use Bioconductor installation instructions and the installation code should not be evaluated.

Done!

  • [ ] šŸŸ¢ It would be good to explain/show any pre-processing steps that were done to the data so that users can understand how their data should be prepared.

Done!

  • [ ] šŸŸ¢ Some discussion/demonstration of why you would want to impute methylation data would encourage users to try the package.

Done!

  • [ ] šŸŸ¢ I got a could not find function "distinct" error in the "run methyLImp" chunk. I'm not sure why this happened but maybe just keep in mind in case it happens to someone else.

I don't use that function anymore, so shouldn't be a problem.

Man pages

  • [ ] šŸŸ¢ Please check the documented return values are correct. I think I noticed at least one case where they didn't match what was actually returned.

You're right, thank you! Done.

  • [ ] šŸŸ¢ There's no need to wrap the examples in { and } (I also saw this a few times in the code)

Thanks! I think I saw someone do it with roxygen and just copied. I removed {} in the examples. In the code, however, I'd like to keep it even for one-liners because that's easier for me to debug. Hope it's ok!

Unit tests

  • [ ] šŸšØ As mentioned already, please add unit tests if possible.

Done!

Code

R

  • [ ] āš ļø Consider if any of the for loops can be replaced with lapply() or sapply(). If you are using for loops make sure that any memory is pre-allocated rather than growing the object in the loop.

Currently it seems to me that replacing any particular loops with lapply or sapply wouldn't change anything. Let me know if Iā€™m wrong!

  • [ ] šŸšØ As mentioned please use {BioParallel} for parallelisation

Done!

  • [ ] šŸŸ¢ Some suggestions about function arguments

    • Adding an assay variable would allow users to select which assay in a SummarizedExperiment to use. It would also be prefereable to add a new assay for the output rather than overwriting an existing one.

I totally agree it would be better! However, my worry is that methylation datasets are really large ā€“ at least 450K columns, some are 850, right ā€“ and adding another slot would make the object really heavy.

  • It would be good to have a variable controlling the number of removed values in generateMissingData

There is a lambda parameter that controls from which Poisson distribution to sample the number of NAs for each chosen probe. I agree it could be interesting to allow control of the number of chosen probes and maybe for overall number of NAs. But on the other hand ā€“ this is an auxilary function that we used in the simulation studies. User should in principle use only methyLImp.

  • Check the argument names are clear, for example col.list doesn't seem to match the description

Yes, thank you! That was internal use name that we never changed. I've checked the rest and it seems fine but please let me know if it's not.

  • [ ] šŸŸ¢ You may want to show how to construct the na_positions object in the vignette, I can see this being difficult for users to work out.

Added a paragraph about that!

  • [ ] ā“ Are the global variables required? I'm not sure if they are needed.

Good question! I've added this because otherwise I was getting warning (or a note?) from R CMD check about undeclared variables.

  • [ ] šŸŽ‰ I really liked your comments! They were super helpful for understanding what was going on.

Thank you so much!

Third-party code

  • [ ] šŸšØ Please check you are following the licenses from any code from other packages, particularly the requirements around using a GPL-3 license.

I think I'm ok but if you have any particular comments about this, I would greatly appreciate it since I'm not experienced in licensing.