Liblinear (Linear SVMs) does not train, exits with "index out of range on Math.Accord..."

accord-net / framework

Machine learning, computer vision, statistics and general scientific computing for .NET

http://accord-framework.net

GNU Lesser General Public License v2.1

4.49k stars 1.99k forks source link

Liblinear (Linear SVMs) does not train, exits with "index out of range on Math.Accord..." #330

Closed andy-soft closed 7 years ago

andy-soft commented 8 years ago

Hi there, I was trying to emulate the simple example of the a9a dataset (included) So I compiled the library into train.exe with Visual Studio 2013 After a while, all ok, now run with command line parameters: -s 2 a9a (the 'a9a' file has been placed into the same directory) the file is read perfectly, but on calling train(problem, parameters); the system exists with this error: (this is the console output)

L2RegularizedL2LossSvc iter 1 act 1.174E+003 pre 1.161E+003 delta 5.718E-001 f 2.348E+003 |g| 6.779E+003 CG 2 cg reaches trust region boundary iter 2 act 1.424E+002 pre 1.253E+002 delta 6.722E-001 f 1.174E+003 |g| 4.085E+001 CG 4 iter 3 act 3.261E+001 pre 2.966E+001 delta 6.722E-001 f 1.032E+003 |g| 3.819E+001 CG 6 cg reaches trust region boundary iter 4 act 5.202E+000 pre 4.930E+000 delta 7.117E-001 f 9.989E+002 |g| 2.083E+001 CG 14 A first chance exception of type 'System.IndexOutOfRangeException' occurred in Accord.Math.dll The program '[0x548] linear.vshost.exe' has exited with code 1 (0x1). ¿any clue?

I could not debug so deeply, don't know all the implementation tricks and issues!

Comment Actually a saw all the Matrix math and vector training is performed over dense matrixes. Therefore I cannot load into memory a huge sparse problem, I tried and it cannot be read.

LIBLINEAR C++ code does this internally as sparse arrays (index, data) and is really very fast, it trains over a whole 99 megabytes text file (240k samples, 70 jagged parameters) just in under 2 seconds. The same code using C# does not end after several hours.

Another thing I want to know if the 'model' files are compatible among C++ and C# (your version) and the loading of the support vectors are equal so if I train on the original C++ code, and load the model file to use it with C#, and just use Decide() ¿Am I right?

¡ and thanks for such a good job!

I guess a sparse vectors implementation, may be faster and less memory hungry than the dense one, (on sparse data, of course)

I am doing lots of NLP work and actually use C# therefore I need your code, I am using some code I've developed on my own but you program faster on new algorithms, and I cannot cope with it.

Even I asked you on CRF some time ago and you just did it! the problem is the sparse data, I have tons of training corpuses, and the problem does not fit in memory (I have only a miserable 8 Gigs on W10x64 and sometimes I guess it needs 120 Gigs or more)

Also I am thinking on using CUDA and optimized code, because training a deep belief network on more than 10k dimensions, and several deep layers becomes impossible on human times. (weeks training) and with CUDA's it can go into a few minutes, rarely going into hours.

best regards, and hope we can find this bug, or whatever

cesarsouza commented 8 years ago

Hi Andy!

Sorry for the delay in answering! I have fixed the index-out-of-range bug and updated the sample application so now it uses the sparse linear kernel by default. And to answer your other question regarding whether its possible to learn using libsvm/liblinear and then load the models in Accord.NET, the answer is definitely yes. You can save and load models using LibSVM's format using the LibSvmModel class. In order to create an Accord.NET SVM with the parameters from a libsvm file, you can first load the file using LibSvmModel.Load(filename) and then call CreateMachine().

Also, whenever you are ready to start learning a machine, please be sure that you are running your application Release mode (instead of Debug) and also that the application is not being run from inside Visual Studio. If Visual Studio is attached to the process, it can impact performance a lot.

Hope it helps!

Regards, Cesar

andy-soft commented 8 years ago

Hi Cesar

Thanks a lot for the upgrade, I was using a C++ port to C# of LIBSVM done by Matthew Johnson, in 2008, .net 2.0 as of version 2.89, and it has big performance issues, also in training linear sets, the memory-class internal architecture is obsolete and dull. However I redesigned it to better performance and due that most of your Accord Libraries don't work with sparse data, I could not use your libraries as I am doing NLP and mostly the data is sparse, and therefore too large to fit in to a dense matrix. I built a hashing algorithm to accommodate the sparseness evenly, as well as doing some random projections to treat the stuff as dense and be able to apply linear algebra like SVD. So your upgrade will help me alot, I'm just hoping your library is as fast as the original C++ I've tested, which is really a racing car!

I am thinking to use some CRF but the implementation you've done uses also dense double[] arrays. ¿have you ever thought on lowering the resolution, single precision is more than welcome on neural networks and gradient descent solvers. Actually some ANN's are using even singles (2 bytes) as float, in a new format having a limited range (which are usually the weights and calculations in any trained system) they use 12 bits mantissa, which gives a 0.05% rounding error, unfortunately they actually run flawless on CUDA structures, but not on x86 and derivatives. Single precision allows almost double speed and half the memory footprint. Even using indexed scheme, like sparse matrixes, on .net the indexes int32 structure has a pitfall of 4 bytes, and mostly there are sometimes less than 256 sometimes, but few use less than 65k (16bits) and I really never saw a problem needing a larger sparseness range like 4.10^12 given by uint32. Also always using sparse matrixes, the only drawback is the data must fit into memory, and the indexed access is somehow slower, but it's compensated by the speedup done by the dot product algorithm which uses a hopping technique alternating two indexes inside a for-loop, which is brilliant and resource economic!

For some internal NLP structures, where I have to store huge dictionaries into memory, I forged some special classes like the fingle float, even using logarithmic math, (for use as probability, but with a mantissa penalty) because probabilities rarely are needed with more than 2 significant digits, but the dynamic range is huge, but also constrained by the physical problem, so you can use a bad-behaved math element, who has all the operations overloaded like my crooked class, and speed up the thing a lot, by using barely one fourth of the memory needed by a stupid double, this speeds things up, and allows you to fit more efficiently into memory any data like a viterbi trellis into memory, with no precision loss. I called it flog

If you read Spanish (which I guess, based on Portuguese) why not take a look at my academic and blog site, you might get impressed with some of my discoveries, I designed a new language, along with a compiler, for which I needed to build the compiler-compiler, scanner, lexer, and whatever. Eve I built syntax highlighted intelligent editor of EBNF which can 'in real time' modify a parser specification structure, compile it, and see how it impacts on a parsing, all in real time, as you type. It's impressive as well as very useful, because editing grammars is tedious, complicated and boring, especially when the grammar is augmented with variables and self-extended structure! I wrote some plugins for Visual Studio to allow to set up breakpoints, and debug step-by-step inside my new language (which is actively parsed and accepted as VS extension language) this is amazingly useful. If you ever need something like this, I can share it in private, I am not releasing the whole on Open source yet. Also if you wanna take a closer look at my short flog structure (floating point logarithmic struct) you ask me and I send you the code.

One last question I was tempted to use Sparse Probabilistic Dirichlet Allocation, to build a uniform semantic representation of words, based on a theory of a multinomial sum of distributions of the "clues, or senses", which need to be trained from large corpuses, like wikipedia (this needs the whole to be sparse) But then, once trained, I want to condense it with holographic reduced projections (HRR, Platt, 2005) and I wonder if you plan to include it inside the huge accord family of machine learning libraries.

best regards, and hoping you do well

Recent

On Sun, Oct 30, 2016 at 10:22 AM, César Souza notifications@github.com wrote:

Hi Andy!

Sorry for the delay in answering! I have fixed the index-out-of-range bug and updated the sample application so now it uses the sparse linear kernel by default. And to answer your other question regarding whether its possible to learn using libsvm/liblinear and then load the models in Accord.NET, the answer is definitely yes. You can save and load models using LibSVM's format using the LibSvmModel http://accord-framework.net/docs/html/T_Accord_IO_LibSvmModel.htm class. In order to create an Accord.NET SVM with the parameters from a libsvm file, you can first load the file using LibSvmModel.Load(filename) http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_Load_1.htm and then call CreateMachine() http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_CreateMachine.htm.

Also, whenever you are ready to start learning a machine, please be sure that you are running your application Release mode (instead of Debug) and also that the application is not being run from inside Visual Studio. If Visual Studio is attached to the process, it can impact performance a lot.

Hope it helps!

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-257150666, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcV1DP2EVAzd4wSSeV2n5DuGjK6vjks5q5JoEgaJpZM4KdrWW .

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

andy-soft commented 8 years ago

Hi Cézar, I could not compile the development framework, even I recompiled the main accord library, but the liblinear refuses to compile with the new addition you've done! here are the errors

Severity Code Description Project File Line Suppression State Error CS0308 The non-generic type 'ProbabilisticNewtonMethod' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 408 Active Error CS0144 Cannot create an instance of the abstract class or interface 'LinearNewtonMethod<Linear, Sparse>' Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0315 The type 'Accord.Statistics.Kernels.Linear' cannot be used as type parameter 'TModel' in the generic type or method 'LinearNewtonMethod<TModel, TKernel>'. There is no boxing conversion from 'Accord.Statistics.Kernels.Linear' to 'Accord.MachineLearning.VectorMachines.SupportVectorMachine<Accord.Math.Sparse, double[]>'. Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0453 The type 'Sparse' must be a non-nullable value type in order to use it as parameter 'TKernel' in the generic type or method 'LinearNewtonMethod<TModel, TKernel>' Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0308 The non-generic type 'ProbabilisticCoordinateDescent' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 444 Active Error CS0308 The non-generic type 'ProbabilisticDualCoordinateDescent' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 453 Active

best regards

On Sun, Oct 30, 2016 at 11:11 AM, Andres Hohendahl < andres.hohendahl@gmail.com> wrote:

Hi Cesar

Thanks a lot for the upgrade, I was using a C++ port to C# of LIBSVM done by Matthew Johnson, in 2008, .net 2.0 as of version 2.89, and it has big performance issues, also in training linear sets, the memory-class internal architecture is obsolete and dull. However I redesigned it to better performance and due that most of your Accord Libraries don't work with sparse data, I could not use your libraries as I am doing NLP and mostly the data is sparse, and therefore too large to fit in to a dense matrix. I built a hashing algorithm to accommodate the sparseness evenly, as well as doing some random projections to treat the stuff as dense and be able to apply linear algebra like SVD. So your upgrade will help me alot, I'm just hoping your library is as fast as the original C++ I've tested, which is really a racing car!

I am thinking to use some CRF but the implementation you've done uses also dense double[] arrays. ¿have you ever thought on lowering the resolution, single precision is more than welcome on neural networks and gradient descent solvers. Actually some ANN's are using even singles (2 bytes) as float, in a new format having a limited range (which are usually the weights and calculations in any trained system) they use 12 bits mantissa, which gives a 0.05% rounding error, unfortunately they actually run flawless on CUDA structures, but not on x86 and derivatives. Single precision allows almost double speed and half the memory footprint. Even using indexed scheme, like sparse matrixes, on .net the indexes int32 structure has a pitfall of 4 bytes, and mostly there are sometimes less than 256 sometimes, but few use less than 65k (16bits) and I really never saw a problem needing a larger sparseness range like 4.10^12 given by uint32. Also always using sparse matrixes, the only drawback is the data must fit into memory, and the indexed access is somehow slower, but it's compensated by the speedup done by the dot product algorithm which uses a hopping technique alternating two indexes inside a for-loop, which is brilliant and resource economic!

For some internal NLP structures, where I have to store huge dictionaries into memory, I forged some special classes like the fingle float, even using logarithmic math, (for use as probability, but with a mantissa penalty) because probabilities rarely are needed with more than 2 significant digits, but the dynamic range is huge, but also constrained by the physical problem, so you can use a bad-behaved math element, who has all the operations overloaded like my crooked class, and speed up the thing a lot, by using barely one fourth of the memory needed by a stupid double, this speeds things up, and allows you to fit more efficiently into memory any data like a viterbi trellis into memory, with no precision loss. I called it flog

If you read Spanish (which I guess, based on Portuguese) why not take a look at my academic and blog site, you might get impressed with some of my discoveries, I designed a new language, along with a compiler, for which I needed to build the compiler-compiler, scanner, lexer, and whatever. Eve I built syntax highlighted intelligent editor of EBNF which can 'in real time' modify a parser specification structure, compile it, and see how it impacts on a parsing, all in real time, as you type. It's impressive as well as very useful, because editing grammars is tedious, complicated and boring, especially when the grammar is augmented with variables and self-extended structure! I wrote some plugins for Visual Studio to allow to set up breakpoints, and debug step-by-step inside my new language (which is actively parsed and accepted as VS extension language) this is amazingly useful. If you ever need something like this, I can share it in private, I am not releasing the whole on Open source yet. Also if you wanna take a closer look at my short flog structure (floating point logarithmic struct) you ask me and I send you the code.

One last question I was tempted to use Sparse Probabilistic Dirichlet Allocation, to build a uniform semantic representation of words, based on a theory of a multinomial sum of distributions of the "clues, or senses", which need to be trained from large corpuses, like wikipedia (this needs the whole to be sparse) But then, once trained, I want to condense it with holographic reduced projections (HRR, Platt, 2005) and I wonder if you plan to include it inside the huge accord family of machine learning libraries.

best regards, and hoping you do well

Recent

On Sun, Oct 30, 2016 at 10:22 AM, César Souza notifications@github.com wrote:

Hi Andy!

Sorry for the delay in answering! I have fixed the index-out-of-range bug and updated the sample application so now it uses the sparse linear kernel by default. And to answer your other question regarding whether its possible to learn using libsvm/liblinear and then load the models in Accord.NET, the answer is definitely yes. You can save and load models using LibSVM's format using the LibSvmModel http://accord-framework.net/docs/html/T_Accord_IO_LibSvmModel.htm class. In order to create an Accord.NET SVM with the parameters from a libsvm file, you can first load the file using LibSvmModel.Load(filename) http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_Load_1.htm and then call CreateMachine() http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_CreateMachine.htm.

Also, whenever you are ready to start learning a machine, please be sure that you are running your application Release mode (instead of Debug) and also that the application is not being run from inside Visual Studio. If Visual Studio is attached to the process, it can impact performance a lot.

Hope it helps!

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-257150666, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcV1DP2EVAzd4wSSeV2n5DuGjK6vjks5q5JoEgaJpZM4KdrWW .

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

andy-soft commented 8 years ago

Could Compile! Had to detach all the nuget pkgs, and reattach manually the development pkgs, version 3.3.1 Ok!

But the bug on the index still remains, here are some files on which it fails!

best regards!

Attached

On Mon, Oct 31, 2016 at 2:58 PM, Andres Hohendahl < andres.hohendahl@gmail.com> wrote:

Hi Cézar, I could not compile the development framework, even I recompiled the main accord library, but the liblinear refuses to compile with the new addition you've done! here are the errors

Severity Code Description Project File Line Suppression State Error CS0308 The non-generic type 'ProbabilisticNewtonMethod' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\ Samples\MachineLearning\Liblinear (SVMs)\Train.cs 408 Active Error CS0144 Cannot create an instance of the abstract class or interface 'LinearNewtonMethod<Linear, Sparse>' Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\ Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0315 The type 'Accord.Statistics.Kernels.Linear' cannot be used as type parameter 'TModel' in the generic type or method 'LinearNewtonMethod<TModel, TKernel>'. There is no boxing conversion from 'Accord.Statistics.Kernels.Linear' to 'Accord.MachineLearning. VectorMachines.SupportVectorMachine<Accord.Math.Sparse, double[]>'. Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0453 The type 'Sparse' must be a non-nullable value type in order to use it as parameter 'TKernel' in the generic type or method 'LinearNewtonMethod<TModel, TKernel>' Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\ Samples\MachineLearning\Liblinear (SVMs)\Train.cs 426 Active Error CS0308 The non-generic type 'ProbabilisticCoordinateDescent' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\ Samples\MachineLearning\Liblinear (SVMs)\Train.cs 444 Active Error CS0308 The non-generic type 'ProbabilisticDualCoordinateDescent' cannot be used with type arguments Liblinear (Linear SVMs) D:_install\Dictionary\Accord.NET v3.3\framework-development\ Samples\MachineLearning\Liblinear (SVMs)\Train.cs 453 Active

best regards

On Sun, Oct 30, 2016 at 11:11 AM, Andres Hohendahl < andres.hohendahl@gmail.com> wrote:

Hi Cesar

Thanks a lot for the upgrade, I was using a C++ port to C# of LIBSVM done by Matthew Johnson, in 2008, .net 2.0 as of version 2.89, and it has big performance issues, also in training linear sets, the memory-class internal architecture is obsolete and dull. However I redesigned it to better performance and due that most of your Accord Libraries don't work with sparse data, I could not use your libraries as I am doing NLP and mostly the data is sparse, and therefore too large to fit in to a dense matrix. I built a hashing algorithm to accommodate the sparseness evenly, as well as doing some random projections to treat the stuff as dense and be able to apply linear algebra like SVD. So your upgrade will help me alot, I'm just hoping your library is as fast as the original C++ I've tested, which is really a racing car!

I am thinking to use some CRF but the implementation you've done uses also dense double[] arrays. ¿have you ever thought on lowering the resolution, single precision is more than welcome on neural networks and gradient descent solvers. Actually some ANN's are using even singles (2 bytes) as float, in a new format having a limited range (which are usually the weights and calculations in any trained system) they use 12 bits mantissa, which gives a 0.05% rounding error, unfortunately they actually run flawless on CUDA structures, but not on x86 and derivatives. Single precision allows almost double speed and half the memory footprint. Even using indexed scheme, like sparse matrixes, on .net the indexes int32 structure has a pitfall of 4 bytes, and mostly there are sometimes less than 256 sometimes, but few use less than 65k (16bits) and I really never saw a problem needing a larger sparseness range like 4.10^12 given by uint32. Also always using sparse matrixes, the only drawback is the data must fit into memory, and the indexed access is somehow slower, but it's compensated by the speedup done by the dot product algorithm which uses a hopping technique alternating two indexes inside a for-loop, which is brilliant and resource economic!

For some internal NLP structures, where I have to store huge dictionaries into memory, I forged some special classes like the fingle float, even using logarithmic math, (for use as probability, but with a mantissa penalty) because probabilities rarely are needed with more than 2 significant digits, but the dynamic range is huge, but also constrained by the physical problem, so you can use a bad-behaved math element, who has all the operations overloaded like my crooked class, and speed up the thing a lot, by using barely one fourth of the memory needed by a stupid double, this speeds things up, and allows you to fit more efficiently into memory any data like a viterbi trellis into memory, with no precision loss. I called it flog

If you read Spanish (which I guess, based on Portuguese) why not take a look at my academic and blog site, you might get impressed with some of my discoveries, I designed a new language, along with a compiler, for which I needed to build the compiler-compiler, scanner, lexer, and whatever. Eve I built syntax highlighted intelligent editor of EBNF which can 'in real time' modify a parser specification structure, compile it, and see how it impacts on a parsing, all in real time, as you type. It's impressive as well as very useful, because editing grammars is tedious, complicated and boring, especially when the grammar is augmented with variables and self-extended structure! I wrote some plugins for Visual Studio to allow to set up breakpoints, and debug step-by-step inside my new language (which is actively parsed and accepted as VS extension language) this is amazingly useful. If you ever need something like this, I can share it in private, I am not releasing the whole on Open source yet. Also if you wanna take a closer look at my short flog structure (floating point logarithmic struct) you ask me and I send you the code.

One last question I was tempted to use Sparse Probabilistic Dirichlet Allocation, to build a uniform semantic representation of words, based on a theory of a multinomial sum of distributions of the "clues, or senses", which need to be trained from large corpuses, like wikipedia (this needs the whole to be sparse) But then, once trained, I want to condense it with holographic reduced projections (HRR, Platt, 2005) and I wonder if you plan to include it inside the huge accord family of machine learning libraries.

best regards, and hoping you do well

Recent

On Sun, Oct 30, 2016 at 10:22 AM, César Souza notifications@github.com wrote:

Hi Andy!

Sorry for the delay in answering! I have fixed the index-out-of-range bug and updated the sample application so now it uses the sparse linear kernel by default. And to answer your other question regarding whether its possible to learn using libsvm/liblinear and then load the models in Accord.NET, the answer is definitely yes. You can save and load models using LibSVM's format using the LibSvmModel http://accord-framework.net/docs/html/T_Accord_IO_LibSvmModel.htm class. In order to create an Accord.NET SVM with the parameters from a libsvm file, you can first load the file using LibSvmModel.Load(filename) http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_Load_1.htm and then call CreateMachine() http://accord-framework.net/docs/html/M_Accord_IO_LibSvmModel_CreateMachine.htm.

Also, whenever you are ready to start learning a machine, please be sure that you are running your application Release mode (instead of Debug) and also that the application is not being run from inside Visual Studio. If Visual Studio is attached to the process, it can impact performance a lot.

Hope it helps!

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-257150666, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcV1DP2EVAzd4wSSeV2n5DuGjK6vjks5q5JoEgaJpZM4KdrWW .

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

Ooops! I am so sorry; I was about to package a new release and was going through the solved issues and just noticed that I haven't seen your last reply to this issue. I am deeply sorry about that.

On the other hand, I haven't received the attachement with the cases that still fail. I couldn't find it in the issue. Do you think you could post it to GitHub by coming to the issue page at https://github.com/accord-net/framework/issues/330 and uploading the file from there?

Again, sorry for replying this late, I really didn't notice that this issue had been updated. The release I am about to make could have included further fixes towards the issue you were facing. I hope it is still not too late for you and your application.

Regards, Cesar

andy-soft commented 7 years ago

Please don't be sorry, you work for free as open sourcer, as well as me! I have solved some of the issues, but never sent you the feedback, I'm sorry too! cheers!

On Thu, Jan 5, 2017 at 8:19 PM, César Souza notifications@github.com wrote:

Ooops! I am so sorry; I was about to package a new release and was going through the solved issues and just noticed that I haven't seen your last reply to this issue. I am deeply sorry about that.

On the other hand, I haven't received the attachement with the cases that still fail. I couldn't find it in the issue. Do you think you could post it to GitHub by coming to the issue page at #330 https://github.com/accord-net/framework/issues/330 and uploading the file from there?

Again, sorry for replying this late, I really didn't notice that this issue had been updated. The release I am about to make could have included further fixes towards the issue you were facing. I hope it is still not too late for you and your application.

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-270786728, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcZNg-RddWXTAfQncah7aPXaUjIlHks5rPXpygaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

Do you perhaps still have some of the files on where it failed?

andy-soft commented 7 years ago

Yes, I can search them, in a couple of days, please have patience! (I am in a deadline-project now)

& thanks for the interest! best!

On Wed, Jan 11, 2017 at 4:46 PM, César Souza notifications@github.com wrote:

Do you perhaps still have some of the files on where it failed?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-271973806, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcfjf735t24b2oqaHBQGxqQVvlavjks5rRTF-gaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

chrisobs commented 7 years ago

Hi Cesar, during some testing with SVMs with/without sparse data I also came across the "index out of range" bug. Today I found this issue and would like to upload my code for you. Program.zip

In the attached file "Program.cs" you will find the method SVM_Without_Sparse. As the name says this is an implementation of MulticlassSupportVectorLearning without usage of sparse data. This works fine with my input files containing 10 and 20 thousand records. The method SVM_With_Sparse is an implementation of MulticlassSupportVectorLearning with usage of sparse data. This works fine with my input file containing 10 thousand records, but with the input file containing 20 thousand records I get an exception with a "index out of range" messeage, while "Learn" is running.

My questions to you:

Can you give me at least an estimate how long you will need to fix this bug?
By running the SVM with sparse data I achieved an performance gain of only 20 percent. Is there anything we can do, to accelerate the performance substantial? I am asking this because the difference of the performance, especially for the prediction, compared with a similar implementation in Python is immense. ,With the input file containing 10 thosusand records, the prediction of Accord.Net needed about 3 seconds. In the Python version this needs about 0.02 seconds.

To make it easier for you, I zipped my test application. The zip-File also includes a Data directory with the input files and an Output directory with some benchmark files. Due to confidentiality rules I have to send you the zip file via eMail.

Many thanks in advance, Chris

andy-soft commented 7 years ago

I had the same problems, actually I use a win64 version of liblinear (executable-binary) to train the models, and generate the model file, then I load the model file and test my data, but training times are horrible under C# and I cannot find the difference, I used a own SVM a full implementation (open sourced and modified by me) (not sparse) and the speed is comparable with any other liblinear executable, even python, something must be wrong in the algorithms. I guess!

thanks for the feedback!

On Fri, Jan 13, 2017 at 12:41 PM, chrisobs notifications@github.com wrote:

Hi Cesar, during some testing with SVMs with/without sparse data I also came across the "index out of range" bug. Today I found this issue and would like to upload my code for you. Program.zip https://github.com/accord-net/framework/files/704563/Program.zip

In the attached file "Program.cs" you will find the method SVM_Without_Sparse. As the name says this is an implementation of MulticlassSupportVectorLearning without usage of sparse data. This works fine with my input files containing 10 and 20 thousand records. The method SVM_With_Sparse is an implementation of MulticlassSupportVectorLearning with usage of sparse data. This works fine with my input file containing 10 thousand records, but with the input file containing 20 thousand records I get an exception with a "index out of range" messeage, while "Learn" is running.

My questions to you:

Can you give me at least an estimate how long you will need to fix this bug?

By running the SVM with sparse data I achieved an performance gain of only 20 percent. Is there anything we can do, to accelerate the performance substantial? I am asking this because the difference of the performance, especially for the prediction, compared with a similar implementation in Python is immense. ,With the input file containing 10 thosusand records, the prediction of Accord.Net needed about 3 seconds. In the Python version this needs about 0.02 seconds.

To make it easier for you, I zipped my test application. The zip-File also includes a Data directory with the input files and an Output directory with some benchmark files. Due to confidentiality rules I have to send you the zip file via eMail.

Many thanks in advance, Chris

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-272473226, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcXR1dT5Mf6ipMUI0ZV1TULMaMHE8ks5rR5sRgaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

Hello Chris, Andy,

Many thanks for all the details on how to reproduce the issue!

It turns out that the issue was related to the presence of zero-length sparse vectors in the training data. I will be committing a fix soon, but in the meantime, it should also be possible to work around the issue by replacing any zero-valued sparse vectors with a sparse vector containing a single 0 at position 0, such as for example:

Sparse<double> zero = new Sparse<double>(new[] { 0 }, new[] { 0.0 });

Regarding performance: since you have a considerable number of classes it might be better to use the MultilabelSupportVectorMachine instead of Multiclass. They should be much faster to evaluate in this case, and its very likely that the Python implementation you are comparing against is also using them.

I will also profile the application using the data you provided and try to make it faster. Thanks again!

Regards, Cesar

andy-soft commented 7 years ago

Hi Cézar,

Thanks for the update and for spotting fast the issue! Just because I am playing with the ugly girl! (doing NLP + cognitive) I may process very long corpuses from time to time, and as I am doing research also, I really don't know sometimes where or what to try to solve some problems. I saw your implementation of deep learning cannot be trained for the corpus and problem sizes I am facing, so I went into CRF but the training and resulting models memory sizes of the result is so huge and unpractical that I discarded it, and only was left with the most commons and well known SVMs. BTW. Multiclass are also not very practical, as they train in two ways: one against all + voting, and tree-fashioned (divide and conquer) just choose to disambiguate and checks against binary group of classes, refining the search with each new SVM's (you did not implement this) until you get 1:1 and classify the best. There are also other classifiers like entropy related (information gain) which also work fine, but have not good implementations to deal with tons of parameters.

I will try also the MultiLabel one but hope this will train fast, my corpus is too big! (cannot grasp the difference)

So, here ends the story (really?) but those were my issues on SVM, which I just shared with you, any comment will be appreciated.

best regards

Andy

On Sat, Jan 14, 2017 at 1:03 PM, César Souza notifications@github.com wrote:

Hello Chris, Andy,

Many thanks for all the details on how to reproduce the issue!

It turns out that the issue was related to the presence of zero-length sparse vectors in the training data. I will be committing a fix soon, but in the meantime, it should also be possible to work around the issue by replacing any zero-valued sparse vectors with a sparse vector containing a single 0 at position 0, such as for example:

Sparse zero = new Sparse(new[] { 0 }, new[] { 0.0 });

Regarding performance: since you have a considerable number of classes it might be better to use the MultilabelSupportVectorMachine instead of Multiclass. They should be much faster to evaluate in this case, and its very likely that the Python implementation you are comparing against is also using them.

I will also profile the application using the data you provided and try to make it faster. Thanks again!

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-272633107, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcU4z0c1dioQRFqKoCCwaHLrvDPOuks5rSPHPgaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

Hi Andy!

You know, first I would like to thank you for all the suggestions for improvements and pointing out situations the framework was not addressing well. As you know, the absence of Sparse vectors in the framework had somehow limited its applicability to large NLP problems.

However, now that the latest big refactoring was over, and now that there is support for Sparse vectors in the framework, its now possible to start filling this and other gaps that existed in the framework for a while. In a few minutes I should commit a few changes to improve the performance of linear SVMs with sparse linear kernels.

Regarding the multi-class approach for SVMs: In my experience with computer vision I've found that whenever the number of classes starts to become large, the one-vs-rest (multi-label) approach becomes more preferable when training SVMs. However, I have to say that I didn't know about the second method you suggested about disambiguating against groups of classes. The framework currently implements two methods for multi-class SVMs: the voting scheme that you mentioned, and the DDAG scheme of Platt et al. Do you think you could give a reference on this other method you mention?

Regards, Cesar

andy-soft commented 7 years ago

I'll search for it, ASAP, but this week I have some major commercial deployment of my NLP system to a customer. I really don't remember where I saw this, as soon as I remember I'll send it to you, but no warrantee!

best!

On Sun, Jan 15, 2017 at 1:38 PM, César Souza notifications@github.com wrote:

Hi Andy!

You know, first I would like to thank you for all the suggestions for improvements and pointing out situations the framework was not addressing well. As you know, the absence of Sparse vectors in the framework had somehow limited its applicability to large NLP problems.

However, now that the latest big refactoring was over, and now that there is support for Sparse vectors in the framework, its now possible to start filling this and other gaps that existed in the framework for a while. In a few minutes I should commit a few changes to improve the performance of linear SVMs with sparse linear kernels.

Regarding the multi-class approach for SVMs: In my experience with computer vision I've found that whenever the number of classes starts to become large, the one-vs-rest (multi-label) approach becomes more preferable when training SVMs. However, I have to say that I didn't know about the second method you suggested about disambiguating against groups of classes. The framework currently implements two methods for multi-class SVMs: the voting scheme that you mentioned, and the DDAG scheme of Platt et al https://papers.nips.cc/paper/1773-large-margin-dags-for-multiclass-classification.pdf. Do you think you could give a reference on this other method you mention?

Regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-272706625, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcYkwwc-YdOB3RSdMpDfid7LEIExyks5rSkuZgaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

chrisobs commented 7 years ago

Hi Cesar, thank you very much for your really fast support!

After updating the Accord.Net packages via NuGet and installing .Net Framework 4.5.2 Developer Package I managed to run my application. It is now based on the following packages: package id="Accord" version="3.4.0" targetFramework="net45" package id="Accord.Controls" version="3.4.0" targetFramework="net45" package id="Accord.IO" version="3.4.0" targetFramework="net45" package id="Accord.MachineLearning" version="3.4.0" targetFramework="net45" package id="Accord.Math" version="3.4.0" targetFramework="net45" package id="Accord.Statistics" version="3.4.0" targetFramework="net45" package id="ZedGraph" version="5.1.7" targetFramework="net45"

Testresults according the "index out of range" bug: I am sorry, but running the method SVM_With_Sparse with the input file containing 20 thousand records, still ends with an exception ("index out of range") while "Learn" is running.

Testresults according the performance issue: Running the method SVM_Without_Sparse with the input file containing 20 thousand records, still needs about 23 seconds for prediction. So I could not recognize any acceleration.

Testresults according the usage of MultilabelSupportVectorLearning: I tried this and it was quite fast. This SVM needed about 6 seconds instead of the 23 seconds for predicting the 20 thousand cases. But I still have to compare the prediction results because when I tested it with 100 cases I already found a case, where no prediction was made (the whole boolean vector was false).

If you need any further information feel free to contact me!

Many thanks in advance, Chris

andy-soft commented 7 years ago

Hello Cézar

Here is the promised paper of the Multi-SVM classes and methods. best!

On Mon, Jan 16, 2017 at 1:30 PM, chrisobs notifications@github.com wrote:

Hi Cesar, thank you very much for your really fast support!

After updating the Accord.Net packages via NuGet and installing .Net Framework 4.5.2 Developer Package I managed to run my application. It is now based on the following packages: package id="Accord" version="3.4.0" targetFramework="net45" package id="Accord.Controls" version="3.4.0" targetFramework="net45" package id="Accord.IO" version="3.4.0" targetFramework="net45" package id="Accord.MachineLearning" version="3.4.0" targetFramework="net45" package id="Accord.Math" version="3.4.0" targetFramework="net45" package id="Accord.Statistics" version="3.4.0" targetFramework="net45" package id="ZedGraph" version="5.1.7" targetFramework="net45"

Testresults according the "index out of range" bug: I am sorry, but running the method SVM_With_Sparse with the input file containing 20 thousand records, still ends with an exception ("index out of range") while "Learn" is running.

Testresults according the performance issue: Running the method SVM_Without_Sparse with the input file containing 20 thousand records, still needs about 23 seconds for prediction. So I could not recognize any acceleration.

Testresults according the usage of MultilabelSupportVectorLearning: I tried this and it was quite fast. This SVM needed about 6 seconds instead of the 23 seconds for predicting the 20 thousand cases. But I still have to compare the prediction results because when I tested it with 100 cases I already found a case, where no prediction was made (the whole boolean vector was false).

If you need any further information feel free to contact me!

Many thanks in advance, Chris

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-272907590, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcd8WHR_YR4glKXLOdnZDKodZYBlQks5rS5s0gaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

Hi Chris,

Sorry, while I mentioned I had committed a fix, I still haven't been able to generate a new package on NuGet with it. I might be able to generate it in the next hours.

Regards, Cesar

cesarsouza commented 7 years ago

Hi Andy,

Have you included it as an attachment? I couldn't find it. Would you mind attaching it to the GitHub issue, or sending the link? Thanks!

Best regards, Cesar

andy-soft commented 7 years ago

Yes! here it comes again! (play it again, Sam)

;)

On Mon, Jan 16, 2017 at 5:33 PM, César Souza notifications@github.com wrote:

Hi Andy,

Have you included it as an attachment? I couldn't find it. Would you mind attaching it to the GitHub issue, or sending the link? Thanks!

Best regards, Cesar

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/accord-net/framework/issues/330#issuecomment-272956067, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcR6J7ovyL2ucss0JMtIa8eMcApiqks5rS9QSgaJpZM4KdrWW .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

cesarsouza commented 7 years ago

I've just uploaded new pre-release packages (v3.4.1-alpha) to NuGet. Please check the "include pre-release" checkbox in your NuGet client inside Visual Studio to be able to download and install them. Please feel free to test the performance of both the Multiclass and Multilabel machines using these new binaries.

Regarding the "Test results according the usage of MultilabelSupportVectorLearning" remarks: I've added a new method to MultilabelSupportVectorLearning to convert objects of this class to instances that expose a multi-class classifier interface. This way it should be possible to use it to solve multi-class instead of multi-label problems without the inconsistencies you've found. I've added an example below on how it can be used.

By the way, the example is also using a new assembly that I've just added: Accord.DataSets. If you would like to run this example, please be sure to also download and install this assembly from NuGet. Using this assembly, it is possible to download and unpack sparse datasets directly from the LibSVM website into a configurable local directory and subsequently use them in code.

Console.WriteLine("Downloading dataset:");
var news20 = new Accord.DataSets.News20(@"C:\Temp\");
var trainInputs = news20.Training.Item1;
var trainOutputs = news20.Training.Item2.ToMulticlass(); // converts from 1...n to 0...n-1
var testInputs = news20.Testing.Item1;
var testOutputs = news20.Testing.Item2.ToMulticlass();  // converts from 1...n to 0...n-1

Console.WriteLine(" - Training samples: {0}", trainInputs.Rows());
Console.WriteLine(" - Testing samples: {0}", testInputs.Rows());
Console.WriteLine(" - Dimensions: {0}", trainInputs.Columns());
Console.WriteLine(" - Classes: {0}", trainOutputs.DistinctCount());
Console.WriteLine();

// Create and use the learning algorithm to train a sparse linear SVM
var learn = new MultilabelSupportVectorLearning<Linear, Sparse<double>>()
{
    // using LIBLINEAR's L2-loss SVC dual for each SVM
    Learner = (p) => new LinearDualCoordinateDescent<Linear, Sparse<double>>()
    {
        Loss = Loss.L2,
        Tolerance = 1e-4
    },
};

// Display progress in the console
learn.SubproblemFinished += (sender, e) =>
{
    Console.WriteLine(" - {0} / {1} ({2:00.0%})", e.Progress, e.Maximum, e.Progress / (double)e.Maximum);
};

// Start the learning algorithm
Console.WriteLine("Learning");
Stopwatch sw = Stopwatch.StartNew();
var svm = learn.Learn(trainInputs, trainOutputs);
Console.WriteLine("Done in {0}", sw.Elapsed);
Console.WriteLine();

// Compute accuracy in the training set
Console.WriteLine("Predicting training set");
sw = Stopwatch.StartNew();
int[] trainPredicted = svm.ToMulticlass().Decide(trainInputs); // convert to Multi-class
Console.WriteLine("Done in {0}", sw.Elapsed);

double trainError = new ZeroOneLoss(trainOutputs).Loss(trainPredicted);
Console.WriteLine("Training error: {0}", trainError);
Console.WriteLine();

// Compute accuracy in the testing set
Console.WriteLine("Predicting testing set");
sw = Stopwatch.StartNew();
int[] testPredicted = svm.ToMulticlass().Decide(testInputs); // convert to Multi-class
Console.WriteLine("Done in {0}", sw.Elapsed);

double testError = new ZeroOneLoss(testOutputs).Loss(testPredicted);
Console.WriteLine("Testing error: {0}", testError);

Unfortunately I haven't been able to test the example above using the assemblies I just uploaded to NuGet, so please excuse me if it doesn't work right away.

Best regards, Cesar

chrisobs commented 7 years ago

Hi Cesar, again - thank you very much for your support!

I will start working on the SVM subject this morning and will send you my testfeedback asap (probably today).

Cheers, Chris

chrisobs commented 7 years ago

Hi Cesar, my first comment "great job!".

Meanwhile I did some testing and attached the results as a JPG: testresults

Testresults according the "index out of range" bug: It seems to be fixed. I could now run method SVM_With_Sparse with the input file containing 20 thousand records. But comparing the prediction results with SVM_Without_Sparse I detected 30 differences. For the records 12520 and 12551 the prediction was 0 instead of 3 and for the records 19364 to 19391 the prediction was 1 instead of 0. Note: In my test I am using the same input data for Learn and Decide! Do you have any idea what the reason might be?

Testresults according the performance issue: The performance has improved tremendously. The prediction for 10 thousend now needs 0:09 seconds in comparison with 2:26 seconds before.

Testresults according the usage of MultilabelSupportVectorLearning: I will probably test this later in the day.

Again - thank you very much for your support!

Regards, Chris

cesarsouza commented 7 years ago

Added in release 3.6.0.