JuliaML / TableTransforms.jl

Transforms and pipelines with tabular data in Julia
https://juliaml.github.io/TableTransforms.jl/stable
MIT License
103 stars 15 forks source link

Add ProjectionPursuit (new PR) #141

Closed DianaPat closed 1 year ago

codecov-commenter commented 1 year ago

Codecov Report

Merging #141 (df7af29) into master (a075bb8) will increase coverage by 0.42%. The diff coverage is 96.80%.

@@            Coverage Diff             @@
##           master     #141      +/-   ##
==========================================
+ Coverage   94.71%   95.14%   +0.42%     
==========================================
  Files          26       26              
  Lines         795      885      +90     
==========================================
+ Hits          753      842      +89     
- Misses         42       43       +1     
Impacted Files Coverage Δ
src/transforms.jl 100.00% <ø> (ø)
src/transforms/projectionpursuit.jl 96.80% <96.80%> (ø)
src/transforms/parallel.jl 95.00% <0.00%> (-0.07%) :arrow_down:
src/transforms/identity.jl
src/transforms/scale.jl 100.00% <0.00%> (+5.55%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

DianaPat commented 1 year ago

Hi @juliohm I already changed to \cdot notation, checked that the diag function really gives a unitary vector and got rid of some outdated notation. But when I tried to change the plots in the visual test, they didn't get plotted all too well: projectionpursuit-1 projectionpursuit-2 One of the possible changes would be to modify the gr settings to change the size of the image, but even so the graphics at the diagonal keep coming out as lines (just like in the images above), so for now I just did the test with each corner separately and will continue checking the problem of the diagonals.

juliohm commented 1 year ago

I already changed to \cdot notation, checked that the diag function really gives a unitary vector and got rid of some outdated notation.

That is great @DianaPat , now we can remove the normalization of alpha inside the rscore function, and only normalize it in the neadermead step. I will propose the changes so that you can review and accept.

But when I tried to change the plots in the visual test, they didn't get plotted all too well:

Let's just plot the result of the corner plot after the transformation in one figure and the result of reverting the transformation in a separate figure. They don't need to be in the same figure. Also, you can pass the option size=(800,800) to the specific plot commands to overwrite the default size set in the gr command.

juliohm commented 1 year ago

Hi @DianaPat can you please double check that this GitHub option is enabled on this PR?

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

I fixed the problem locally and would like to push the changes to the PR.

DianaPat commented 1 year ago

It is enabled

juliohm commented 1 year ago

Hi @DianaPat it is saying that permission is denied. Can you incorporate the fixes in your local copy and regenerate the figures?

  1. Remove the option aspectratio=:equal from the runtests.jl file and place it in every existing plot command that existed before in the test files. Basically, scale.jl, zscore.jl, eigenanalysis.jl, if I recall correctly.
  2. Delete all png files in the test/data directory and rerun the tests to generate new figures.
  3. You should see all corner plots correctly plotted now. The aspectratio option was the issue.

After you make the changes you will see that the projection pursuit result is not looking quite right yet. The final distribution is not spherical visually indicating that either the number of iterations wasn't enough or that some other bug is still hidden somewhere.

juliohm commented 1 year ago

Thank you @DianaPat. As you can see the first case is not working as expected given that one of the bivariate distributions doesn't look spherical. Can you please take a closer look at what may be happening? After the apply is fixed we we need to also plot the revert.

DianaPat commented 1 year ago

Yes, I'm already looking for this bug and other ones I've found. I'll update as soon as I find something!

DianaPat commented 1 year ago

Hi @juliohm, I think the error is fixed.

In fact the function was not comparing the data to a gaussian up to perc percentile, but with 1-perc. There was also an issue with just taking a higher percentile because the optim function raised an error. That part was just fixed with the float at line 93. But then the tests stopped passing (at this point I'm wondering how did they pass before), so I relaxed the conditions.

At this point I think maybe it would be better to change all occurrences of perc to 1-perc in the code so we can keep the previous value, as it is more intuitive.

juliohm commented 1 year ago

Hi @DianaPat thank you for checking it. Can you please move the float call to within the definition of the basis function? This function should return a floating point vector always to avoid type instability.

Regarding the percentile option, you mean that it is more intuitive to use the value .99? It would be great if you could make the appropriate changes. After that, it would be interesting to print the number of structure removal iterations that are happening in both test cases. It seems that the structure is not quite Gaussian still? Visually, the resulting point cloud has some borders that differ from those of a multivariate Gaussian.

DianaPat commented 1 year ago

I already made some experiments. Without any modification of parameters it runs the structure removal cycle 23 times. By changing the parameters a little, it reached 101 iterations; the problem is that it took aprox. 300 seg, so the tests would take more time.

I saved the figures to do a comparison of improvement:

test2 test

juliohm commented 1 year ago

I am finding it strange that the the points near the "tails" of the resulting distribution are somewhat spread. If you plot the ideal standard Gaussian you will be able to see the difference. Do you know how to use the VSCode debugger? You could add a plot command inside the structure removal loop and then add a breakpoint to see the plot at each iteration. Another thing you can do is check the optimal alphas as we did in class to see if everything is working as expected. Finally, a good diagnostic would be to plot the q-q plot of the transformed columns against the standard Gaussian to see if they pass through the identity line.

DianaPat commented 1 year ago

You are right! The tails diverge from a gaussian outside the interval [-3,3]. I double checked the removal structure function but it's working as expected. I went back to the bibliography (Friedman pg 255) and found this: Tails_under When I tried increasing the deg argument, there was a slight improvement, but it came at the cost of time.

juliohm commented 1 year ago

That is a very good catch. Can you comment on the time increase? Is it 2x slower? How the plots look like with higher degrees? Can we find a good compromise by default?

Em dom., 16 de out. de 2022 19:46, DianaPat @.***> escreveu:

You are right! The tails diverge from a gaussian outside the interval [-3,3]. I double checked the removal structure function but it's working as expected. I went back to the bibliography (Friedman pg 255) and found this: [image: Tails_under] https://user-images.githubusercontent.com/105749646/196061909-119106f3-f4c8-461f-a664-4cc8d7baff4b.PNG When I tried increasing the deg argument, there was a slight improvement, but it came at the cost of time.

— Reply to this email directly, view it on GitHub https://github.com/JuliaML/TableTransforms.jl/pull/141#issuecomment-1280074599, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3MAVSGX4ULX2O5SD7TWDSAUBANCNFSM6AAAAAAQZXLHKQ . You are receiving this because you were mentioned.Message ID: @.***>

DianaPat commented 1 year ago

The time increase is ~3.5x: With deg=5 it took around 16 sec to obtain the answer (image on the left), and with deg=30 it took 352 sec (image on the right) qq1 qq30 I think the improvement is not enough to offset the cost. An idea could be to do another process post-transformation to normalize the tails.

juliohm commented 1 year ago

Thank you @DianaPat for these careful checks. I agree with you that the time cost does not pay off, and we just need to be aware of this behavior near the tails.

I will try to review the PR one more time during the day. It would be important to plot the transformed and reversed data for both data sets if that is not already done.

juliohm commented 1 year ago

That is really awesome @DianaPat. I think we finally made it! 🎉 Congrats!

One additional change we could implement to avoid depending on the heavy Optim.jl package is replace it with the NelderMead.jl package, which only implements the algorithm are currently using: https://github.com/jwscook/NelderMead.jl

You can learn more about the algorithm in this lecture: https://youtu.be/vOYlVvT3W80 It is beautiful and simple idea.

In theory, we would just need to replace the Optim.optimize function by the NelderMead.optimise function and rerun the tests. If you feel that you have some time to try this out, please go ahead. Otherwise, I invite you to review the original PR you submitted to compare how much you improved your coding skills in the process 👏🏽

These picky reviews pay off in the long run, and I am really happy you went all the way to the end! 🥇

DianaPat commented 1 year ago

Thank you, I learned a lot from this experience!

On the other hand, I tried to implement the change of package, but it seems that the autor have not registered the package, so I couldn't do it. For now will leave it as is while I look for other alternatives.

juliohm commented 1 year ago

I've opened an issue requesting registration: https://github.com/jwscook/NelderMead.jl/issues/5

juliohm commented 1 year ago

@DianaPat NelderMead.jl is now registered 👌🏽 would you like to give it a try here to see if the results are the same?