garydoranjr / misvm

Multiple-Instance Support Vector Machines
BSD 3-Clause "New" or "Revised" License
234 stars 81 forks source link

Overflow- INT_MAX reached #21

Open RomainCendre opened 4 years ago

RomainCendre commented 4 years ago

Hi everyone, I'm actually running some of my work on your library ( pretty weel designed by the way), and I'm facing some issue during fit step. I got arround 100 000 instances split on 5000 bags.

Problem is encountered in quadprog.convert() method at line P = cvxmat(H), where H is a sample*sample matrix. I got some matrix that contains arround 10 000 000 000 values, and by the ways exceed the authorized number of values (limited to INT_MAX values).

Do someone have some clues on how to solve that? Best regards

garydoranjr commented 4 years ago

Hi @RomainCendre,

I don't think there's a straightforward way to overcome this issue. Fundamentally, kernel methods are O(n^2) in memory complexity without doing some tricks that would require re-writing the existing MIL algorithms to incorporate these tricks (e.g., only using a subset of the instances as candidate support vectors). Even if you didn't hit this INT_MAX limit, you'd run out of memory storing such a large matrix.

One potential workaround for a large MIL problem like this is to use a bag-level classifier rather than an instance-level classifier (if your problem permits this). For example, an approach like the NSK or STK will only construct a kernel matrix that is O(n_bags^2) instead of O(n_instances^2), which for 5,000 bags is more doable. The disadvantage is that with these methods it's harder to get good instance-level labels on the test set (they only work well at classifying an entire bag at a time).

Hope this helps, -Gary

RomainCendre commented 4 years ago

Hi @garydoranjr, Thx for the time you took to answer and for the clarity. As you said, I will first take a look at NSK / STK and I will tell you about it. Take Care, Best regards

Le jeu. 28 nov. 2019 à 18:16, Gary Doran notifications@github.com a écrit :

Hi @RomainCendre https://github.com/RomainCendre,

I don't think there's a straightforward way to overcome this issue. Fundamentally, kernel methods are O(n^2) in memory complexity without doing some tricks that would require re-writing the existing MIL algorithms to incorporate these tricks (e.g., only using a subset of the instances as candidate support vectors). Even if you didn't hit this INT_MAX limit, you'd run out of memory storing such a large matrix.

One potential workaround for a large MIL problem like this is to use a bag-level classifier rather than an instance-level classifier (if your problem permits this). For example, an approach like the NSK or STK will only construct a kernel matrix that is O(n_bags^2) instead of O(n_instances^2), which for 5,000 bags is more doable. The disadvantage is that with these methods it's harder to get good instance-level labels on the test set (they only work well at classifying an entire bag at a time).

Hope this helps, -Gary

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/garydoranjr/misvm/issues/21?email_source=notifications&email_token=AEN47KIPOPYQAJVDDV5QJXTQV74GRA5CNFSM4JRFXU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNEULY#issuecomment-559565359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEN47KLQQG7UR4INACH3NRDQV74GRANCNFSM4JRFXU6A .