Peteysoft / libmsci

All my scientific software thrown into one bucket.
Other
40 stars 4 forks source link

what is the "train" file mean for the command pdf_agf? #4

Closed shekmun closed 8 years ago

shekmun commented 8 years ago

As far as I know, calculate the probability density function(pdf) using kernel density estimation(kde) does not need a training procedure.

The syntax is like this: pdf_agf [-n] [-k k] [-W Wc] train test output

The simple example explained here(http://scikit-learn.org/stable/modules/density.html) shows that pdf is a summation of the kernel. So why would I need a train file and how can I get the file.

p.s. The only feature I need is calculating the pdf using kde with one dimension. So how can I do this in a simple way?

Thanks in advance.

Peteysoft commented 8 years ago

The "train" parameter is the name of a data file in binary format containing samples of points: the "training data". The "test" parameter is the name of a data file (also in binary format) containing points ("test points") at which an estimate for the density of the training data is desired. The "output" parameter is the name of the file in which to store the results.

The points comprising the training data are sometimes assumed to be distributed according to some underlying probability density function, but need not be. The density is nonetheless a meaningful quantity and in this case will simply be the approximate number of points per unit of area, divided by the total number of points.

In the example you give there are two input parameters comprising training samples and test points, just like with pdf_agf.  It is just that for simplicity, they are one and the same: the density is estimated at the same points as the training samples.

Input the training data (X):

kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)

Estimate the density at the test points (X):

kde.score_samples(X)

To summarize:

train: contains a distribution of points test: contains points where an estimate of the density of the distribution are desired output: estimates are written here

Hope this helps.

shekmun commented 8 years ago

Thanks. I see what you mean.

So the train file is the original data used for getting the pdf. And the test file is for calculating the density at the exactly test point on the pdf we just get.

Thank you for your detailed answer. I appreciate your work.