csgillespie / poweRlaw

This package implements both the discrete and continuous maximum likelihood estimators for fitting the power-law distribution to data. Additionally, a goodness-of-fit based approach is used to estimate the lower cutoff for the scaling region.
109 stars 24 forks source link

Question - input data format (raw data vs. frequencies) #7

Closed scais closed 11 years ago

scais commented 11 years ago

Hello,

I would like to ask You, which kind of data should I use as an input for creating displ$new(data) object. In example is written "The Moby Dick dataset contains the frequency of unique words", so that mean, before passing my data to displ$new, they should be in frequency format?

For better imagination,here is an exampe. For 10-sided dice, if I am randomly throwing it, I am getting these numbers:

3 7 5 3 2 1 10 8 4 1 1 1

If I make frequnecies from these numbers, I would get:

1 - 4 2 - 1 3 - 2 4 - 1 5 - 1 6 - 0 7 - 1 8 - 1 9 - 0 10 - 1

Which numbers - "raw data" (3,7,5,3 ...) or "frequencies" (4,1,2,1 ...) do I pass as an argument to the function?

I am asking, because inside the displ object are inner argument containig frequencies. But these frequencies seem like to be "frequencies from frequencies" and because of it, I thought,the input should be "raw data".

Thank You,

Stepan

csgillespie commented 11 years ago

Hi Stepan,

In general, the input is the "raw data".

Regarding your question about Moby Dick. In this example, the data collected would have been individual words. So the raw data here is how many times did each word appear, i.e. word frequency.

The reason I use "frequencies from frequencies" within the distribution object is for efficiency.

Does that make sense?

Cheers

Colin