Evizero commented 8 years ago

I was asked today how one would start contributing to JuliaML and if there is a list somewhere. This is my attempt to start such a list.

Because we are evolving as we go along, it must be quite difficult for a potential new contributor to see what is going on and where to start. Especially since our packages have discussion threads that must take hours to read. Yet, we have functionality that is clear and stable enough to offer tasks that are appropriate as a first contact. It seems like a good idea to me to list them at a central place. I think the following entries should be self explaining to what I have in mind.

Please post issues that you think would fit into this list. cc: @tbreloff @joshday @ahwillia

List of available Introduction issues

Losses.jl

This package is intended to provide a fast backend for computing loss functions. As such, we would like it to be as complete as possible; even if that means including Losses that are seldom utilized in practice.

New Losses
- [x] ~~DWDMarginLoss #54 (online reference provided)~~
- [x] ~~L2MarginLoss #44~~
- [x] ~~PinballLoss #42~~
- [x] ~~ExpLoss #50~~
- [x] ~~SigmoidLoss #51~~
Documentation
- [x] ~~Improve documentation #46~~
  Penalties.jl

Penalties/regularization (i.e. Ridge and LASSO) for machine learning. This may be merged with Losses.jl at some point.

Penalties currently implemented are Ridge (L2Penalty), Lasso (L1Penalty), elastic net, and SCAD.

New Penalties (edit from chris: these bullet points need actionable information)
- [ ] MC+
- [ ] other?

ContinuousOptimization

Tons of help needed

ChrisRackauckas commented 8 years ago

It might be a little ambitious to keep a list of the things to contribute in different packages all here, doubling up with what's likely in the package's issues themselves, and probably getting out of date fast.

Evizero commented 8 years ago

Currently it is pretty manageable. We will rethink this as soon as packages become more mature and self sustained

sarvghotra commented 8 years ago

@Evizero Do you have any suggestion for a resource to refer for these loss and penalty functions ?

Evizero commented 8 years ago

Sorry for the late reply.

Well my main resource concerning properties has been Steinwart, Ingo, and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.

Other than that should the issues themself provide the main loss-function that needs to be represented. The other implemented loss functions show how and what is required. I suppose some previous knowledge about loss functions and taking derivatives is needed.

naveenjafer commented 7 years ago

@Evizero Hi, I am getting started with JuliaML and looking to contribute where possible. Is this page up to date with the developments that have been carried out recently? I have decent exposure to ML and Neural nets, Looking for a good starter task to explore a little more and also gauge my capabilities realistically. Would be great to have some suggestions for the same from you. TIA

Evizero commented 7 years ago

Hi, welcome! no I don't think we updated the page for a while. Thinks are still moving around a lot, so it is difficult to keep information organized.

Currently a lot is in a limbo transition as I am getting some stuff ready, so some packages like MLMetric, and MLDataUtils are pretty much broken on their master branch until MLLabelUtils is registered and my new LossFunctions additions merged. Keep in mind that a lot of our current attention goes into "basics", such as splitting dataset, encoding classification targets, loss functions, and metrics.

I can't quite think of a good intro issue to work on right now, but maybe we can find something that interests you that we are not yet working on? Anything subtopic in particular that you would like to explore?

ChrisRackauckas commented 7 years ago

On that note, we should think of some GSoC projects. Proposal time will start to be soon.

naveenjafer commented 7 years ago

@Evizero One library that has really caught my attention apart from JuliaML over the past couple of weeks has been mlpack. There are quite a few simple api's to start with in mlpack that can be implemented in Julia. How about I start listing out the functionality that is missing in JuliaML currently and I could start working on some of them for starters.

naveenjafer commented 7 years ago

@Evizero Why exactly are JuliaStats and JuliaML maintained separately? There are no overlaps as far as I see from a superficial glance. Would what I would like to implement fit better into JuliaStats? I am interested in exploring a little bit more on what has been done with K-means and KNN neighbours in Julia.

tbreloff commented 7 years ago

Naveen: partly politics. JuliaStats was established and the members were not so interested in rebuilding. We wanted to start with a fresh design that incorporates more than just classic stats.

In the long term, perhaps the orgs will merge...

On Tue, Jan 3, 2017 at 4:49 AM Naveen Jafer notifications@github.com wrote:

@Evizero https://github.com/Evizero Why exactly are JuliaStats https://github.com/JuliaStats and JuliaML maintained separately? There are no overlaps as far as I see from a superficial glance. Would what I would like to implement fit better into JuliaStats? I am interested in exploring a little bit more on what has been done with K-means https://github.com/JuliaStats/Clustering.jl and KNN neighbours https://github.com/KristofferC/NearestNeighbors.jl in Julia.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/14#issuecomment-270098498, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492pQtqdTjoxq3xk3f6K5dCZN1v9fZks5rOjWtgaJpZM4JwRnN .

Evizero commented 7 years ago

I think a big reason for separate Orgs is that JuliaStats is pretty mature already and thus it is difficult to experiment with changing established things. With JuliaML we have more freedom to try different things, fail here and there, learn from it, improve, and reiterate. On the other hand getting changes into JuliaStats need to be in a good consistent quality right from the get go and thus there is just less freedom for exploration.

If you are interested in K-means clustering then probably you are under better guidance there for the time being.

Evizero commented 7 years ago

Naveen: partly politics. JuliaStats was established and the members were not so interested in rebuilding. We wanted to start with a fresh design that incorporates more than just classic stats.

I think that sounds a bit harsher than intended. I completely understand that one is reluctant to change things just because people (i.e. us) come up there with visions of what could be (without any hard proof or demonstrated effort). So there is no bad blood between JuliaStats and JuliaML at all I do not think. We just decided to play in our own playground in order to see if we can actually build what sounds so good in our heads. And we need the freedom to break things to do so.

ChrisRackauckas commented 7 years ago

I think the bigger thing is vision. JuliaStats is looking to build tools you'd expect from other "Base libraries". I mean, it has some of the old R gurus, and it's looking to implement all of the functionality you'd expect to see from R/Python/MATLAB in terms of statistics/machine learning.

JuliaML is very much a "inspired by Julia" kind of thing. JuliaML is re-designing how to build machine learning libraries to make them miles more flexible by using the unique functionality of Julia. It's about building modular tools so you can mix and match all of the internal components and create your own algorithms.

So while JuliaStats is more developed (it's older), most of its libraries "just do something": you have data, put it in and get clusters out (the standard old way). JuliaML's libraries have far more potential to be a truly interesting research tool because they are modifiable at every step of the way, which means it will definitely be a tool for researchers and professionals, but maybe not a Base beginner's tool (why would you care about specifying weight functions and all of that jazz when you don't really know the mathematics and just want a function which spits out clusters?). Because of that, there's quite a bit of separation between what JuliaML and JuliaStats actually does, even if they are at face value in similar territory.

tbreloff commented 7 years ago

Well said Chris. I agree.

On Tue, Jan 3, 2017 at 10:51 AM Christopher Rackauckas < notifications@github.com> wrote:

I think the bigger thing is vision. JuliaStats is looking to build tools you'd expect from other "Base libraries". I mean, it has some of the old R gurus, and it's looking to implement all of the functionality you'd expect to see from R/Python/MATLAB in terms of statistics/machine learning.

JuliaML is very much a "inspired by Julia" kind of thing. JuliaML is re-designing how to build machine learning libraries to make them miles more flexible by using the unique functionality of Julia. It's about building modular tools so you can mix and match all of the internal components and create your own algorithms.

So while JuliaStats is more developed (it's older), most of its libraries "just do something": you have data, put it in and get clusters out (the standard old way). JuliaML's libraries have far more potential to be a truly interesting research tool because they are modifiable at every step of the way, which means it will definitely be a tool for researchers and professionals, but maybe not a Base beginner's tool (why would you care about specifying weight functions and all of that jazz when you don't really know the mathematics and just want a function which spits out clusters?). Because of that, there's quite a bit of separation between what JuliaML and JuliaStats actually does, even if they are at face value in similar territory.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaML/Roadmap.jl/issues/14#issuecomment-270176575, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492g40ndVSkqGLTxMv3R-a5ZtpsL-fks5rOoqVgaJpZM4JwRnN .

naveenjafer commented 7 years ago

@ChrisRackauckas That was really helpful in understanding both better. Thank you. I'll start off with different implementations for the KMeans and then come back here once I get a good hang of Julia.

americast commented 7 years ago

Hello everyone, I was planning to build a package which would make the creation of data-providing packages easier (reference: http://julialang.org/soc/projects/general.html#standardized-dataset-packaging).

I would like to know which problems should I solve next in order to get more acquainted with the project.

Gramercy.

tbreloff commented 7 years ago

@americast I think first step would be to scour the existing packages to know what exists. See for example: https://github.com/oxinabox/CorpusLoaders.jl. I think @Evizero might have a package which loads some datasets as well. You could work on unifying the interface and discovery mechanisms for:

searching for available datasets
wrapping existing solutions in a common interface
add new data sources

tbreloff commented 7 years ago

Forgot to mention! https://github.com/JuliaML/MLDatasets.jl

tbreloff commented 7 years ago

FYI... I added a link to this issue from the website. In the future when we are listing ways to contribute to individual parts of the ecosystem, I think it would be a good idea to add a bullet point to the first comment in this issue with a link to an issue in the respective repo.

Evizero commented 7 years ago

What is happening with the GSoC proposals? There are two projects listed under "JuliaML" and I knew of none of them. One even explicitly states my name as mentor, which I was completely unaware of.

Guys, I appreciate everyone wanting to move forward and promote things, but we can't just make promises like that on a whim. While I don't expect to be "in the know" for everything, there should at least be a clear understanding and agreement who would mentor a project before throwing out ideas and hope someone will step up.

I do think being a GSoC mentor comes with some responsibility and such a promise should not be thrown around lightly. In the future we must be more organized or it will make us all look bad.

ChrisRackauckas commented 7 years ago

That seems like it was a mistake that will get cleared up.

https://github.com/JuliaLang/julialang.github.com/commit/581d6d441570b3935f7634518c4d3e7678bd9942

Evizero commented 7 years ago

https://github.com/JuliaLang/julialang.github.com/pull/521

americast commented 7 years ago

@ChrisRackauckas The changes made in https://github.com/JuliaLang/julialang.github.com/commit/581d6d441570b3935f7634518c4d3e7678bd9942 seems to have been reversed by https://github.com/JuliaLang/julialang.github.com/commit/a4e3a12fb4fa4ca1794e342339e09c1f6c220d96. Is John Myles White the mentor for the project on standardised dataset packaging (http://julialang.org/soc/projects/general.html#standardized-dataset-packaging) ?

Gramercy.

americast commented 7 years ago

@tbreloff I have started developing the package. I have created a small abstract of the search and download feature here: https://github.com/americast/DataDeps.jl/blob/master/src/DataDeps.jl. I request you to kindly have a look at it. It would be nice if you could tell me which specific features would be better in this regard.

Gramercy.

americast commented 7 years ago

@MikeInnes I am developing the package here: https://github.com/americast/DataDeps.jl (Documentation: https://github.com/americast/DataDeps.jl/wiki). Please have a look. Though the project is in it's infancy now, it would be nice if you would kindly provide some feedback or create issues in the repo. It would help me in making further plans.

Also, just today I came across this package: https://github.com/JuliaML/MLDataUtils.jl/ It's purpose looks similar to the package I am developing. Would it be better to work on this package rather than creating a completely new one? Pl guide me in this regard @tbreloff @Evizero.

Gramercy again...

Evizero commented 7 years ago

@americast I am sorry that you receive so little feedback, but that is how it goes sometimes. It is a bit unfortunate that JuliaML was listed under mentors for that project proposal since it seems apparent that no member has the time or interested to actually mentor it.

Concerning MLDataUtils. As I see it a DataDeps package is conceptually completely different, since its role should probably be similar to BinDeps.

americast commented 7 years ago

@Evizero Thanx a lot for your response! It's true that DataDeps is conceptually completely different. I was wrong in saying that it's purpose is same as MLDataUtils. But I guess DataDeps should have many of the features of MLDataUtils, but they are to be used in am entirely different manner. Hence, I was wondering if starting to build the package from MLDataUtils would be a better approach. For eg, features like segregation of data into training, validation and testing needs to be done in both the packages.

I am looking for mentor to guide me a bit. Can you suggest a mentor @tbreloff @Evizero @johnmyleswhite @StefanKarpinski @MikeInnes @ahwillia @bicycle1885 @malmaud @ninjin @paulhendricks @pluskid ?

oxinabox commented 7 years ago

DataDeps shouldn't really do any data segregation; or preprocessing. Just acquisitions.

It should perhaps do a little Data munging to force the data in to a usable (MLUtils friendly) shape.

But I'm not even sure how much of that should be done by the package, and how much should be done by the people calling the package. Eg it is out of scope for it to for example parse Malformed SGML (which is the official format of the SemEval datasets)

As to segregation:

either the data has official test/valid/train splits at different URLs -- which can be handled as entirely separate.
Or it has the splits as folders etc with-in say a zip you download -- this would benifit from some magic that stops the same download happening twice
or there are no splits, in which case it is a user's problem.

What DataDeps should focus on is data loading centric things.

This means the ability to load the data into/and from smart places. Good defaults; to save on user worrying about where the data is going to. Eg first try to save/load from /usr/share/julia_data_dep/ then fall back to ~/.julia/DataDeps/data. As well as giving the user the option to specify the location.

It should also allow for handling mirrors. If a dataset is available on multiple mirrors, it wants to choose the closest/fastest -- or give the user an option. Or try one at random then fall back if that one is down. Possibly if a mirror is down it should automatically open an issue on the github for the package that is providing/demanding the data.

A good way to handle this kind of thing is one of the parts I am wanting for https://github.com/oxinabox/CorpusLoaders.jl/

I am not willing to mentor this project; at this point in time.

juliohm commented 4 years ago

Could you please share the status of the JuliaML organization? I'd really like to get more involved with package development. In particular, I've started using LossFunctions.jl in the GeoStats.jl stack because it is awesome. I have some open issues and pull requests that are taking a bit longer to be addressed than I expected, and so I am here asking.

Can I get more involved somehow? What is the process for joining the org or at least gaining write access to some of the repositories here?

Thank you,

oxinabox commented 4 years ago

We should determine a process for adding people to the org. Several people are much less invoved than they were (which is reasonable enough). And I only touch MLDataUtils stuff, which is not the whole org (and even then only minimal maintaince).

I feel like as a adhoc solution, adding well-known community members, and giving them full permissions, like @juliohm is reasonable. And if noone objects I will just do that on friday .

oxinabox commented 4 years ago

Because no one has objected, I will be adding @juliohm to the org presently.

@juliohm: the stuff about being responsible I am sure goes without saying. Try not to merge your own PRs without review. Doesn't hugely need to be someone from the org doing the review, but someone. I strongly recommend following the Continuous Delivery practices which boil down to: if reasonable tag a release after every PR, or if not then change the Project.toml version to be suffixed with -DEV.

oxinabox commented 4 years ago

@juliohm seems you already are a member and have full permissions on the packages in the org, including LossFunctions.jl

joshday commented 4 years ago

I just added him this morning. Glad to have you @juliohm!

juliohm commented 4 years ago

Thank you guys! Happy to join! I am also very pedantic with code review and collaborative development, so you can be sure that I will follow the guidelines strictly. 👍

JuliaML / META

Contributing to JuliaML #14

List of available Introduction issues

Losses.jl

Penalties.jl

ContinuousOptimization