Closed annaplaksienko closed 8 months ago
Hi @annaplaksienko
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: methyLImp2
Title: Missing value estimation of DNA methylation data
Version: 0.99.0
Authors@R: c(
person("Pietro", "Di Lena",
role = c("aut"),
comment = c(ORCID = "0000-0002-1838-8918")),
person("Anna", "Plaksienko",
email = "anna@plaxienko.com",
role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-9607-057X")))
Description: This package allows to estimate missing values in DNA methylation data. methyLImp method is based on linear regression since methylation levels show a high degree of inter-sample correlation. Implementation is parallelised over chromosomes since probes on different chromosomes are usually independent. Mini-batch approach to reduce the runtime in case of large number of samples is available.
Imports: parallel, dplyr, stats
Depends: R (>= 2.10)
URL: https://github.com/annaplaksienko/methyLImp2
License: GPL-2
Encoding: UTF-8
RoxygenNote: 7.2.3
biocViews: DNAMethylation, Microarray, Software, MethylationArray, Regression
Suggests:
knitr,
rmarkdown
VignetteBuilder: knitr
From vignette:
Be default, the only two things you have to provide are the data matrix and the type of the data: 450K or EPIC. But here are a few other things you can tune so that the method suits your needs best:
please see contributions.bioconductor.org -- when multiple samples are in use we need to have sample-level information and feature-level information and these are coordinated in a SummarizedExperiment or derivate class.
Hi! Sorry for the silly question, my first time. Do I understand it correctly that since my package is in pre-review and therefore not yet on git.bioconductor.org, I can't yet bump up the changes there, right? Otherwise I've fixed the issues you've mentioned.
Corret. Please push the changes to github and we will re-examine the package. Once its precheck passed
we will process, add to git, and assign a reviewer for a more formal review.
Cheers!
Ok, thanks! Pushed the changes here https://github.com/annaplaksienko/methyLImp2 š
Hi! Just wanted to check if you need more time to check or if I did something wrong and it's not obvious I'm ready for the check :))
A subset of GSE199057 dataset for vignette demonstration
Description
The GSE199057 Gene Expression Omnibus dataset contains 68 mucosa samples from non-colon-cancer patients, from which we randomly sampled 24. Methylation data were measured on EPIC arrays and after removal of sex chromosomes and SNPs loci, it contains 816 126 probes. We have subset only probes from two chromosomes (18 and 21) for the sake of demonstration.
Usage
data(beta)
Format
A data frame
Value
A data frame.
I see github comments about using SummarizedExperiment but no illustration of how it works. Why are you supplying 450k and EPIC annotation, those already exist in Bioconductor packages, e.g. https://bioconductor.org/packages/release/data/annotation/html/IlluminaHumanMethylationEPICanno.ilm10b2.hg19.html
Both decisions were made for the ease of the package and installation. A data frame/matrix is lighter and easier object for demonstration in the vignette. User can provide SummarizedExperiment will in exactly the same way as they provide a matrix. Same for annotation: Bioconductor package is quite heavy with many dependencies, while only the match between CpGs and chromosomes is needed. Personally, I ran into installation problems just now.
Now when I think about it, it's fair to ask for SummarizedExperiment in the vignette, I'll changed it on the weekend. But let me know how crucial is to use the package for annotation. As I said, I was aware of it and just preferred to go for a lighter option.
Hi! I've changed an example data in the vignette to SummarizedExperiment. Let me know what you think about it. Also, please let me know if it's possible to keep the annotation as the package data frame instead of adding a dependency!
Hi! I just wanted to check if you have any feedback :)
Hi! Are there any updates? Thanks!
I was away last week; I will process momentarily to get a reviewer assigned
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.
Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/methyLImp2
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hi! I got two errors: one is that I'm not subscribed to the mailing list, but I am š¤ And I'm confused about the tarball size, could you please clarify that for me?
Hi! Just wanted to ask if it's possible to get a feedback on my errors :) Thanks!
Sorry, I've had some unexpected things come up so I've been slower to reply than normal. For the mailing list please double check that you are subscribed with the same email as listed in the DESCRIPTION
. You can see the tarball size by running R CMD build
on the command line (or maybe something like devtools::build()
). I'm not sure what the Bioconductor limit is off the top of my head but it should be in the developer documentation (probably a few MB). Please also try to address as many notes in the report as you can.
Received a valid push on git.bioconductor.org; starting a build for commit id: 55312cbd8cb2e4a1d3a8965df1f2b47060e44336
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/methyLImp2
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hi! I've found that each file has to be less than 5MB but nothing about the total tarball size š¤ Please let me know if you find anything!
I also tried to reduce the number of notes. About unaddressed ones:
1) There are ::: calls to the package's namespace in its code. This was done to avoid a warning/note (I don' remember) about calling library at the core when parallel computing. Is there a better way to do this?
2) The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.
Technically, if you don't consider ignore documentation and comments, only one function is too long (methyLImp). Should I split that one?
3) Consider multiples of 4 spaces for line indents; 81 lines (8%) are not.
Could you please give me advice on how to find these lines? I've fixed a bunch but apparently not everything.
And for unit tests I'll need some more time if it's unavoidable :))
Hi! I've found that each file has to be less than 5MB but nothing about the total tarball size š¤ Please let me know if you find anything!
I think the overall package size should also be 5MB, the developer book says:
The source package resulting from running R CMD build should occupy less than 5 MB on disk.
See https://contributions.bioconductor.org/general.html?q=size#package-size
I will double check this though because I'm not sure it matches up with the 5MB limit on individual files.
- There are ::: calls to the package's namespace in its code. This was done to avoid a warning/note (I don' remember) about calling library at the core when parallel computing. Is there a better way to do this?
I think we want to avoid this if possible, I would have to check during the review to see if I have a suggestion. We strongly encourage developers to use the {BiocParallel} package for parallel processing so please look at switching to that if you aren't using it already.
- The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.
Technically, if you don't consider ignore documentation and comments, only one function is too long (methyLImp). Should I split that one?
I'll check during the review but it's probably fine, that's more of a guideline
- Consider multiples of 4 spaces for line indents; 81 lines (8%) are not.
Could you please give me advice on how to find these lines? I've fixed a bunch but apparently not everything.
This can often be the documentation comments or other cases where it makes sense to indent by a different number so probably fine to ignore when it's this low.
And for unit tests I'll need some more time if it's unavoidable :))
This is also very strongly encourages so please try to add tests if you can.
Got it, thank you! Ok, I'll work on the unit test while you'll checking other stuff!
Hi @annaplaksienko
Thanks for submitting methyLImp2 :tada:! Below is my review of your package. Please reply here if anything is unclear or needs any further explanation. I have repeated some of the things we already discussed so they are in one place.
What next?
Please address the comments as best as you can. When you are ready for me to check the package again please reply to let me know with a summary of changes you have made or any other responses. You can use the "Quote reply" option in the ...
menu on this comment to reply directly to my points below.
Luke
Key: :rotating_light: Required :warning: Recommended :green_circle: Optional :question: Question
BugReports
field. This is usually a link to the GitHub issues page or the Bioconductor support forum.URL
field. This is usually a link to the GitHub repository.distinct()
function. You may want to consider if this can be done another to way to avoid the (rather large) dependency.could not find function "distinct"
error in the "run methyLImp" chunk. I'm not sure why this happened but maybe just keep in mind in case it happens to someone else.{
and }
(I also saw this a few times in the code)for
loops can be replaced with lapply()
or sapply()
. If you are using for
loops make sure that any memory is pre-allocated rather than growing the object in the loop.SummarizedExperiment
to use. It would also be prefereable to add a new assay for the output rather than overwriting an existing one.generateMissingData
col.list
doesn't seem to match the descriptionna_positions
object in the vignette, I can see this being difficult for users to work out.Hi! Thank you for such a detailed review! I'll get to work š
Hi @lazappi! A question about BiocParallel: can I ask what's the motivation behind switching to it? It's not a problem to do, I've implemented it, but it increases my running time. As you can see inside the package, I apply parallelization to numeric matrices, not Bioconductor structures. So I'm just curious to know whether BiocParallel is necessary.
I believe the motivation behind asking people to use {BiocParallel} is that it provides consistency between packages, i.e. they all have the same interface for controlling parallel processing. The other thing is that it allows the user to control exactly how parallelization is done which is important when packages will be run on different operating systems, servers etc. as some things don't work on all systems.
I'm not really an expert on the details but it shouldn't slow things down too much as it is just a wrapper around the normal parallel processing systems. Maybe the overhead is bigger than I thought though (if you are using the same underlying parallelization is both cases)?
Yep! I use socket in both cases so that it runs on Windows too.
Just checking in to see how you are going addressing the comments and to let you know the dates for the next release are now available https://bioconductor.org/developers/release-schedule/.
Hi Luke, thanks for checking in! I've finished working on almost all of your comments but at the same time I got feedback for the accompanying paper so now I'm thinking whether I should change something in the package. I'll let you know as soon as I'm done!
Hi Luke! Just letting you know not to give up on me, I'm almost there :)
The deadline for adding new packages is Wednesday so you would probably need an update today for me to have any chance of looking at it (and I'm not sure I can guarantee that). I don't want to put pressure on you but just making sure you are aware what the situation is.
If that's fine with you, I;m totally fine with being added to spring edition later. Would that work?
Yeah, that's totally fine if you want to put it off until the next release.
@annaplaksienko should we close this issue for now until you are ready to push updates?
Almost there! What would closing the issue mean/what are possible drawbacks?
nothing really. except you would want the issue to be reopened before you pushed updates to trigger a build report else you would likely have to do an arbitrary version bump to trigger a build report for re-review.
but if you think it will be updated in the next two weeks we can leave it open for now?
Ok, got it! Yes, let's keep it, I'll push through!
Hi! I have to say I'm still struggling with the size of the package. I'm over the requirements because I have methylation annotation in my package. Not even detailed annotation: it's just a data frame with probes in one column and chromosomes in the second one. But since there are 850K probes and the names are characters, the file is too heavy.
As I understand, instead I'm supposed to make my package rely on IlluminaHumanMethylationEPICmanifest (and same for 450K). And I suppose people who are working with methylation data probably already have it. But if not, it take s AGES to install, if it does (I had multiple experience with errors). So I'd like to avoid that... But I need annotation.
Is it possible at all to be over the size limit for the package?
The core people can confirm but I think the size limit is pretty strict for software packages. I'm not sure what kind of installation issues you have had but these core annotation packages should be reliably available. Apart from the size issue there are other advantages to reusing the annotation, you don't have to worry about maintaining it and your package should automatically benefit if something changes in the annotation.
I would be curious to know more about your installation issues as well as this should not be the case; I just installed in a fresh R so it had to install many dependencies too and it took 3.5 mins to install all dependencies and the illumina package.
Is it possible to use a smaller sample size in the package to be able to truncate/filter the data provided?
Another option is to host the data in AnnotationHub (in this case probably setting up your current package instead of hosting an entirely separate package for it) and/or on something like Zenodo and pull down/cache the data only when examples/vignettes are run?
Hi! The problem is not examples or vignettes but the data that is used in the main method. The input data is first split over chromosomes according to the annotation. Since the only thing needed is the match probes-chromosome, so basically two columns, it doesn't make a lot of sense to me to use whole Illumina manifest package. That one imports minfi, that has many dependencies, so that may be problematic. One can, of course, argue, that people who work with methylation data may already have minfi, so that won't be a problem, but not everyone uses that package. Therefore I thought that adding a data with those two column, probe-chromosome, is easier and more beneficial.
I am trying to rework around Illumina manifest package.
I looked into Illumina EPIC manifest package and I can't see any match between probes and chromosomes there (although it is present in the csv file on the website that I used). Maybe I'm doing something wrong?... Let me know if you can access it!
Maybe AnnotationHub is the best solution š¤ Do I understand it correctly that storing the data on my Github is not good enough for making such package?
correct. we do not allow data to be stored on github. It must be in a trusted source like zenodo, S3, microsoft data lakes, etc.
Ok, thanks! Working on it!
Received a valid push on git.bioconductor.org; starting a build for commit id: 7f7d4d276f757143ad340edc39cb400080681976
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on the Bioconductor Single Package Builder.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): methyLImp2_0.99.4.tar.gz
Links above active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/methyLImp2
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
Hi @lazappi! Finally, I'm done with the review (at least a hope so). Thank you again for really looking into details! I'm looking forward to discussing the package further.
General package development
- [ ] šØ We already discussed the package size but I have confirmed the limit for a built package is 5 MB. Please try to reduce the size of the package below this.
Fixed that! I now use annotation from an existing Bioconductor package instead of internal one and I reduced the size of the dataset for the vignette.
DESCRIPTION file
- [ ] ā Is there already a {methyLImp} package? I'm wondering if an existing package should be updated instead or whether this might cause confusion for users.
Good point! We'd like to keep the first version as the separate package since it is linked in two previous papers + we use that code when we compare running time and performance of v1 and v2. However, you're totally right about possible confusion! So on {methyLImp} Github page we have added a README file with the recommendation to use {methyLImp2} instead. What do you think about it?
- [ ] š¢ It is recommended to add a
BugReports
field. This is usually a link to the GitHub issues page or the Bioconductor support forum.
Done!
- [ ] š¢ It is recommended to add a
URL
field. This is usually a link to the GitHub repository.
There is one already. Is there something wrong with it?
- [ ] š¢ It looks like you only import the {dplyr} package for one use of the
distinct()
function. You may want to consider if this can be done another to way to avoid the (rather large) dependency.
Thanks a lot, great point! It was actually easy to re-write in base R so I removed that dependency.
Package data
[ ] ā ļø Some of the included data is quite large which I think is why the built package is over the limit. Here are some things to consider to try and reduce the dataset size.
- Check the data is compressed
- Can the example dataset be reduced in size? Maybe by removing some of the samples/variables?
- Is there an existing Bioconductor package that contains the array annotation data? If so I would recommend depending on that which would reduce the size of your package and give you one less thing to update and maintain.
Done! Reduced the example dataset in size and switched to ChAMPdata Bioconductor package for the annotation.
- [ ] ā ļø Please provide a URL to the repisitory containing the pre-processing in the function documentation. It would also be good if you briefly described what was done.
Done! Both for the links and brief description.
Documentation
- [ ] š¢ There were a few typos in different places, the {spellcheck} can be used to check for these.
Thank you, I didn't know about that!
README
- [ ] šØ Please include Bioconductor installation instructions in the README
Done!
Vignette
- [ ] šØ Please use the {BiocStyle} package for formatting the vignette
Done! Question: in the vignette "author" field, do I specify author of the vignette, authors of the package or authors of the method?
- [ ] šØ Please include a table of contents in the vignette
Done!
- [ ] šØ Please include an Introduction section. This should include the motivation for inclusion in Bioconductor and a comparison to similar packages (if applicable)
Done!
- [ ] šØ Please include an Installation section. This should use Bioconductor installation instructions and the installation code should not be evaluated.
Done!
- [ ] š¢ It would be good to explain/show any pre-processing steps that were done to the data so that users can understand how their data should be prepared.
Done!
- [ ] š¢ Some discussion/demonstration of why you would want to impute methylation data would encourage users to try the package.
Done!
- [ ] š¢ I got a
could not find function "distinct"
error in the "run methyLImp" chunk. I'm not sure why this happened but maybe just keep in mind in case it happens to someone else.
I don't use that function anymore, so shouldn't be a problem.
Man pages
- [ ] š¢ Please check the documented return values are correct. I think I noticed at least one case where they didn't match what was actually returned.
You're right, thank you! Done.
- [ ] š¢ There's no need to wrap the examples in
{
and}
(I also saw this a few times in the code)
Thanks! I think I saw someone do it with roxygen and just copied. I removed {} in the examples. In the code, however, I'd like to keep it even for one-liners because that's easier for me to debug. Hope it's ok!
Unit tests
- [ ] šØ As mentioned already, please add unit tests if possible.
Done!
Code
R
- [ ] ā ļø Consider if any of the
for
loops can be replaced withlapply()
orsapply()
. If you are usingfor
loops make sure that any memory is pre-allocated rather than growing the object in the loop.
Currently it seems to me that replacing any particular loops with lapply or sapply wouldn't change anything. Let me know if Iām wrong!
- [ ] šØ As mentioned please use {BioParallel} for parallelisation
Done!
[ ] š¢ Some suggestions about function arguments
- Adding an assay variable would allow users to select which assay in a
SummarizedExperiment
to use. It would also be prefereable to add a new assay for the output rather than overwriting an existing one.
I totally agree it would be better! However, my worry is that methylation datasets are really large ā at least 450K columns, some are 850, right ā and adding another slot would make the object really heavy.
- It would be good to have a variable controlling the number of removed values in
generateMissingData
There is a lambda parameter that controls from which Poisson distribution to sample the number of NAs for each chosen probe. I agree it could be interesting to allow control of the number of chosen probes and maybe for overall number of NAs. But on the other hand ā this is an auxilary function that we used in the simulation studies. User should in principle use only methyLImp.
- Check the argument names are clear, for example
col.list
doesn't seem to match the description
Yes, thank you! That was internal use name that we never changed. I've checked the rest and it seems fine but please let me know if it's not.
- [ ] š¢ You may want to show how to construct the
na_positions
object in the vignette, I can see this being difficult for users to work out.
Added a paragraph about that!
- [ ] ā Are the global variables required? I'm not sure if they are needed.
Good question! I've added this because otherwise I was getting warning (or a note?) from R CMD check about undeclared variables.
- [ ] š I really liked your comments! They were super helpful for understanding what was going on.
Thank you so much!
Third-party code
- [ ] šØ Please check you are following the licenses from any code from other packages, particularly the requirements around using a GPL-3 license.
I think I'm ok but if you have any particular comments about this, I would greatly appreciate it since I'm not experienced in licensing.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
[x] I am familiar with the Bioconductor code of conduct and agree to abide by it.
I am familiar with the essential aspects of Bioconductor software management, including:
For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.