Google Summer of Code 2020

cbeleites commented 4 years ago

I'd like to suggest a hyperSpec project for this year's Google Summer of Code. R Wiki has pages with general info and project list.

I'd like to couple this with a general discussion how we want hyperSpec to evolve in the near future. IMHO some refactoring is overdue.

Here are some ideas:

I'd like to split off some packages in order to make hyperSpec more easily maintainable. Also, if the code is in more but smaller packages, changes in dependencies that break one such package won't affect the rest of hyperSpec (the recent trouble of hyperSpec being archived on CRAN would then have effected at most two import packaged - but chances are that fixing deadlines would have allowed both to be fixed in time.)

I've already started a separate package hyperSpec.tidyverse to provide functions like filter() etc. and allow working with pipes.
file import: the file import functions (or more precisely: 150ish MB the test data files) are the cause for hyperSpec's unusual package structure. This could be avoided by splitting off the file import.
The test files can also be moved into the test directory without bloating these packages for CRAN by using .Rbuildignore
graphics with ggplot2: hyperSpec has very rudimentary ggplot2 support. Move this into a separate package and expand.
matrixStats is another candidate for such treatment. There was a hyperSpec.matrixStats package in the past, but matrixStats evolved quite rapidly, and I wasn't able to keep up with the required level of maintenance. This has gotten easier now, since CRAN checks also whether an update breaks packages that reverse depend on the updated one - and requires a statement and communication beforehand. I'd thus expect less surprises nowadays.
- While I've in the past fought to not get too much high-level functionality into hyperSpec because of feature creep and maintenance burden, separte packages are a good place to have this.
- This also includes conversion functionality between different spectra-related packages.
possibly even python modules since combined working in R and python has become very convenient (though my experience so far is only with using R + python together interactively or in .Rmd documents)

That being said, there's also functionality on our wish list, in particular wrt. graphics and file import (I have a bunch of test files for formats that we cannot yet read)

What we need

Please fill in here if you'd be willing to mentor or be GSoC student:

Mentors:

Claudia (@cbeleites): I'm happy to co-mentor. I'll probably be around most of the summer, but I may be away for several days on short notice. I also have a conference.
Roman (@ximeg): I would be happy to co-mentor, and can provide expertise with git, Python and ggplot2. If a student happens to be around Memphis, I can also organize a lab tour and show real instruments for spectroscopy in action!

Students:

everyone please ask around and see if you can find suitable students

bryanhanson commented 4 years ago

I'll gladly co-mentor, so CB if you want to create the project at the Wiki you can add my name or I'll add once you put the basic info there.

Tomorrow I will write with some general comments about your bullet points for suggested directions.

ximeg commented 4 years ago

Claudia, taking part at GSoC is a great idea, and I would be happy to mentor a student or two! I really like your ideas, especially I favor the support of pipe operators, elegant interface to ggplot2, and the interface to Python. That would be really cool features! However, your other suggestions are important as well. The things that could improve maintainability and reduce our overhead may be even more important at this stage. The old truth is "Fixing a bug today costs less than tomorrow"

bryanhanson commented 4 years ago

Some thoughts about the refactoring talking points, and a few other things, to start the conversation:

The proposed changes sound good to me, but they are extensive. It looks like great attention to coordination/timing will be needed to keep whole thing working. How to split things out needs to be carefully vetted and not hurriedly. Once existing functions are broken into sub-packages and all the parts work together but are essentially unchanged, then it will be much easier to improve the pieces on their own timelines.
About 2 years ago I created ChemoSpec2D for 2D NMR, with as many parts parallel to the existing ChemoSpec. For my sanity and DRY principles, I broke out all the common stuff into ChemoSpecUtils. This proved to be a very good thing, except when a totally new feature is introduced. Since ChemoSpec and ChemoSpec2D both depend heavily on ChemoSpecUtils, you can't deploy them until a new ChemoSpecUtils is first approved and deployed. In this case, ChemoSpecUtils can't have examples or test code that requires the new but as yet undeployed versions of ChemoSpec and ChemoSpec2D. So I wrote a funciton that tests for a particular package version being available to handle this, and also, sometimes you just have to leave the examples as dontrun until all 3 updated packages are available. There are other tricks as well, see the entry on tinytest.
I don't recommend using git submodules as a way to organize the project. I have used other people's projects with submodules before, and it is never straightforward. A bit of exploration on StackOverflow on the topic will show that no one finds submodules easy to deal with. I don't know of an alternative that would let hyperSpec developers keep all the parts current when working on a single part, except that one might be able to build an elaborate makefile to check for updates across all the sub-packages and bring them in as needed.
I switched to tinytest instead of testthat, primarily because it is much leaner with no dependencies. It also has a nice mechanism one can use to detect if one is not on CRAN. This allows more time-consuming tests to run locally, and can include tests that only work locally because that is where all the new versions of the sub-packages reside in harmony. Re-writing a few testthat tests to tinytest could be a GSOC demo task.
If/when you want to update the JCAMP-DX reading functions, I'd suggest re-writing to incorporate the API of my readJDX package. readJDX has reached maturity and stabilty. The only DX files it can't currently read have demonstrable errors in them. It is a bit slow, because it is faithful to the original verification steps described in the standard, but one doesn't import files every day. Another possible GSOC test for students would be to re-write the hyperSpec DX readers to use the readJDX API.
ggplot2 support: This is no doubt the way to go, but be prepared to monitor ggplot2 development in detail. Historically, they have used a "move fast and break things" approach and their API for using ggplot in functions kept changing. I gave up using ggplot2 in one instance because they kept breaking my stuff. I do think it is more stable in the past year however. And they do extensive reverse dependency checks now, but you'd need to fix on their schedule.
I have some experience trying to use reticulate in functions deployed to Travis and it was sufficiently difficult to get the build environment and the Python virtual environment right that I gave up. Using Python locally where one can control and inspect things much more readily is feasible. Now that Python 3.x is the only supported series things might improve gradually.

I hope the points above don't sound too negative -- I just have an aversion to repeating negative experiences!

Comments?

ximeg commented 4 years ago

@bryanhanson

The proposed changes sound good to me, but they are extensive. It looks like great attention to coordination/timing will be needed to keep whole thing working.

Yes, we need to be careful in designing workpackages in such a way that they are not dependent on each other. Suppose one workpackage is a big refactoring of the whole project and other WPs focus on new features. This cannot run in parallel, especially with people new to the project and potentially new to software development methods. We should reasonably assume that any of them could fail and we would have to finish their part ourselves, in the worst case reverting the code to the original state. We need to be careful here.

I don't recommend using git submodules as a way to organize the project.

I have had some experience with them... Yes, one can learn how to work with them, but I agree that this is not really straightforward and many struggle with submoldules. I also agree with you that submodules are not necessary for splitting up the project, because R itself does a wonderful job on the dependency checking and installation of missing packages during the build process.

cbeleites commented 4 years ago

(I'm using @bryanhanson's numbering)

It looks like great attention to coordination/timing will be needed to keep whole thing working.

First of all, I don't expect to get more than one student unless we can formulate two clearly and quite different sets of tasks. So, work will have to be sequential for that reason already. Also, we'll probably have to pick our choice of which tasks we tackle.
Dependencies and order of deployment : Yes, that is certainly true.
- For the file import, I'm envisioning small separate packages (hyperSpec_read_spe, hyperSpec_read_spc, ...). These will depend on hyperSpec (and possibly on other packages) but not the other way round.
- This already separates hyperSpec from breaking because any such other dependeny changes. (Which would have entirely avoided the recent incident of hyperSpec being kicked off CRAN)
- Yes, this means that possibly an update in the import package will have to wait until hyperSpec is updated on CRAN. But that's fine with me because I expect that dependency to be very "light": most import functions need few things from hyperSpec:
- new ("hyperSpec", ...),
- .fileio.optional()
- and for unit tests the internal storage and/or checksum generation (so we depend on that as well) to stay the same. As @bryanhanson says, we may need to have the unit tests version dependent.
- We may even think to make hyperSpec only a suggestion rather than a dependency. (I'm not sure whether that would help with deployment trouble, though.)
The result would be a bunch of import packages depending on hyperSpec but being totally separate of each other. The situation may be more difficult between hyperSpec.tidyverse and hyperSepc.ggplot2, though.

As for the timeline of moving existing functionality from hyperSpec to a new package, there wouldn't be any need to remove the function from hyperSpec before that new package is ready. We can internally mark it as "DON'T CHANGE". My exyperience is that I very rarely need to touch the file import functions, and they are reasonably separate of each other
And, no, I don't see us using git submodules.
The current situation with git lfs is already sufficiently non-standard that many of the spectroscopic collaborators (who have only slight experience with git) are thrown off. I'd rather consider whether we can do without git lfs if the file import packages are sufficiently small and we keep the test files small as well.
tinytest vs. testthat:
- no dependencies is a decided advantage
- I have not yet looked into it in sufficient detail to have an opinion whether to change or not.
- If we change, we'd need some of the dependencies we have via testthat now, e.g. for checksum generation.
- However, we could start using tinytest with the new packages.
  
  detect if one is not on CRAN. testthat has skip_on_cran() for that. Internally, both packages seem to do pretty much the same thing here.
I'd be happy to use your JCAMP-DX reader - I'm glad to hear that it is mature now.
ggplot2 moving fast and breaking things:
It is not the only package of interest to hyperSpec that does this. I temporarily gave up on matrixStats. And the tidyverse is another candidate for this. Nevertheless, I'd like to see hyperSpec working with them as that would be a tremendous improvement for everyday work.

IMHO the way to go here is to separate that functionality again into their own packages. If bad things happen, at least it won't hit core hyperSpec then.
reticulate/python: I do take that warning seriously. Again, IMHO if a separate package experiments in that direction, that's fine with me. For hyperSpec itself the recent trouble with system wc already showed me that I won't even consider anything like this anytime soon again in core hyperSpec.

eoduniyi commented 4 years ago

Dear hyperSpec Team,

My name is Erick Oduniyi and I'm an undergraduate student studying computer engineering* at the University of Kansas (KU). Though, this semester I'm currently taking classes at Wichita State University (WSU). In general, I'm interested in complex systems and becoming a better R developer.

I recently found out about GSoC, so I know I'm running short on time, but I'm extremely interested in participating in the hyperSpec GSoC 2020!!! I'm currently working on the hard and very hard test questions, as well as, the application template. Both of these I will try to have done by tomorrow.

For a little more information about me, here is a link to my current CV and a writing sample from one of the physics classes I've taken. At any rate, I'm super excited to learn more about the hyperSpec/hyperSpec.tidyverse R packages!

*I suspect I'm really a psychologist or ecologist in disguise ;)

Best, E. Oduniyi

ximeg commented 4 years ago

Hi @eoduniyi, I apologize for a late reply. Thank you very much for your interest and your impressive CV! We always need helping hands and it would be very exciting to have you onboard. Good luck solving the tests a let us know if you have questions.

Roman

ximeg commented 4 years ago

@eoduniyi , please also take a look at GSoC wiki for R project, the table of proposed projects and add your name to the GSoC hyperSpec wiki page. Thanks!

12VISHESH commented 4 years ago

Hello Sir,

I am Vishesh Tripathi 3rd year BTech student. I am very excited to work on your project which is named "hypersec" I have gone through this project and done the easy task written in that project. I have completed one task and rest I am working on that. Looking forward to complete this project under your mentorship I know there is too late for this but I m working on this issue.

Thanks and regards VISHESH

eoduniyi commented 4 years ago

Hi Roman no worries at all! Thank and will do!

Best, EO

On Tue, Mar 24, 2020 at 10:40 PM RKiselev notifications@github.com wrote:

Hi @eoduniyi https://github.com/eoduniyi, I apologize for a late reply. Thank you very much for your interest and your impressive CV! We always need helping hands and it would be very exciting to have you onboard. Good luck solving the tests a let us know if you have questions.

Roman

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cbeleites/hyperSpec/issues/99#issuecomment-603622783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHN7UN2T7YMJOZIIZ3BZPW3RJF4J3ANCNFSM4KP43GFA .

cbeleites commented 4 years ago

Dear Erick @eoduniyi,

I trust you got my review for your hard task pull request yesterday?

I'd like to encourage you to sign up for GSoC (if you have not yet done so) start writing the proposal ASAP: the proposal deadline is Tuesday 31st 18:00 UTC (!), and this is a hard deadline - you won't be able to participate in GSoC this year if you don't have your final proposal submitted by then. Compared to this, finishing the test tasks would be somewhat lower priority seeing that you have practically finished the hard task and already started working on the very hard task.

If you put a first draft proposal online until tomorrow (Fri) noon CDT, we can have a look at it and give you suggestions how to improve it over the weekend.

Many thanks,

Claudia

eoduniyi commented 4 years ago

Dear Claudia @cbeleites,

I apologize for the radio silence. I was focusing on the proposal draft. And yes, I got your review for the hard task pull and will work on implementing your suggestions ASAP.

I have signed up for GSoC and have linked a pdf version of the proposal draft here and shared it via the GSoC application. I'm probably going to copy and paste my proposal into LaTeX, so I have better control of the final output/version.

Claudia and @ximeg while I'm checking over my proposal I will continue to try and finish the hard and very hard task correctly. Also, I have not yet put my name on the wiki because I am writing up my solutions in markdown and will look something like this. Though, I should have that completed by tonight.

My email is eeoduniyi@gmail.com if it's easier to send feedback that way; please let me know if there are other channels that the group would prefer.

Thank you and best regards, EO

eoduniyi commented 4 years ago

Dear All,

I have submitted my final proposal to the GSoC 2020 website.

Best, EO

bryanhanson commented 4 years ago

Thank you Erick!

ximeg commented 4 years ago

Erick became a GSoC student with his project 'Fortification of hyperSpec'. More details in this blogpost

ximeg commented 4 years ago

GSoC issue

cbeleites / hyperSpec