very large p dataset - Githubissues

StephanSeifert / SurrogateMinimalDepth

This package is archived and further developed under the name RFSurrogates. In this R-package functions are provided to select important variables and to investigate variable relations using surrogate variables.

https://github.com/AGSeifert/RFSurrogates

11 stars 0 forks source link

very large p dataset #11

Open ghost opened 4 years ago

ghost commented 4 years ago

I have a dataset with 37,634 features and a total of 1162 rows/observations. I tried running var.select.md with ntree=1000 and it seems to hang:

Growing trees.. Progress: 58%. Estimated remaining time: 22 seconds.

Is the method suitable for datasets with such a large p?

Thank you.

StephanSeifert commented 4 years ago

It should work, also for this large p. Does it still not work, even with the update I did some month ago? PS: sorry for the late response... Best regards Stephan

Sterls commented 3 years ago

I'm experiencing a similar issue and I am not sure if I am the cause of the problem.

Data: 801 rows x 20000 cols of floating point values

Code: res.smd = var.select.smd(x = df[,1:ncol(df)-1], y = y, s = 20, ntree = 10000)

Output: Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 33 seconds. Growing trees.. Progress: 49%. Estimated remaining time: 1 minute, 5 seconds. Growing trees.. Progress: 73%. Estimated remaining time: 34 seconds. Growing trees.. Progress: 98%. Estimated remaining time: 2 seconds.

This has been hanging for about 12 hours. I don't think I did anything wrong because when I use the same data with ntree = 100, I get the expected output. However, in your paper, I see that you used 10,000 trees for a dataset with 1000 features so I intended to gradually increase the number of trees and monitor how the output changes. Is this a known issue?

StephanSeifert commented 3 years ago

Hi,

thank you very much for your message. How many cores are you using and is it a classification or regression setting? I think the determination of surrogate variables could be slow when high numbers of samples are used and many trees are grown with min.node.size of 1. (because for each of the nodes the surrogate variables have to be found) I usually apply it on a workstation with high numbers of cores to reduce running time. If you have access to one I would be very interested if this solves the issue or if there is another problem instead of the pure running time.
Otherwise you could also try to use a higher number of minimal node size to reduce the running time. The results would probably be not as good as using 1 but since you have quite a view samples the results should still be satisfying, e.g. using 5 or 10. I would be happy to hear if one of these ideas is working for you...

Best regards Stephan