IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Organise > Column: Factor > Factor Data Frame Dialogue - with a look to climatic #2016

Open rdstern opened 7 years ago

rdstern commented 7 years ago

I couldn't find an existing issue on this dialogue. Please move this over, if I have missed it.

Thank you for adding this dialogue. The idea will become important. It is great to have this initial version working. There are one or two improvements to make and some important issues that this dialogue raises.

  1. I was confused by the label "Factor Names". Please change that to "Data Frame Name". And the default could be the name of the factor column.
  2. I then tried to use the dialogue again and to replace the data frame. I don't think this feature is working. At least I changed the name and nothing changed in my factor data frame.

Now some questions. These partly relate to future ways I expect us to use this feature (i.e. to have a data frame for the factor).

  1. I wonder about having an initial column with the integers 1, 2, 3. It could be called "Order". At least this could be an option.
  2. The general idea is that from the factor we could produce the data frame - as we do here. But also we can now add additional columns to this data frame. Then we could (if we wish) change the Levels/Labels into those from another column in this data frame.
  3. Or our summary statistics, etc, could add summaries if they are from the same factors.
  4. We can alternatively start with a (factor) data frame and then use it (perhaps via the Levels/Labels dialogue) to add the labels to a factor column in another data frame.
  5. If we can change the factor data frame, then we must be a little careful with this dialogue - when we have made changes - so we don't then simply overwrite. I assume there will be a link in our metadata, between the data data frame and the factor data frame. Perhaps, in a factor data frame, we should also note if it is changed. If so, then we can't simply overwrite it.
  6. Similarly we must be careful if we have a factor data frame, and hake changes in the factor column of the data, e.g. reorder levels, or delete unused levels. Do we then give an option to reflect those changes in the factor data frame.
  7. I assume if we have 2 factor data frames for the same factor, then the second takes precedence, and the first simply becomes and ordinary data frame, i.e. a factor can only have one factor data frame linked to it.
  8. If so, then information from other related (or earlier) factor data frames can be added (through merge) to our existing factor data frame. 9.. Looking ahead to climatic, then one facility will be to read data from Climsoft.
    a) We will be able to read data from the station table, and this will give us the station information (potentially) for all stations in a country. This will include lat and long and altitude, etc. b) Then we will also read some climatic data from one or more stations. I assume the default will then be to have the climatic data "stacked", so "Station" can be a factor. c) Now we make our factor data frame - with just the stations we are analysing. d) Now we must ensure the names match - which is likely as they come from the same database.
    e) Now we can merge information from our station data frame. f) Then we are well-placed to add further "station-level" data from our data date frame.
rdstern commented 7 years ago

I have tried using this dialogue again. I have another bug to add to the list above. We need to be getting back to some of these issues soon - not full-time, but splitting work duties.

Anyway I used the survey data and did a factor sheet for Village. That worked fine. Then I did it again - called the new one a different name and un-ticked the option to save the contrasts. I pressed OK at it ignored me. Nothing happened. So I deleted the previous factor sheet and tried again. This time I got an error, namely:

Error running R command(s)

Error in names(factor_data_frame)[2:ncol(factor_data_frame)] <- paste0("C", : 'names' attribute [2] must be the same length as the vector [1]

The error occurred in attempting to run the following R command(s):

InstatDataObject$create_factor_data_frame(data_name="survey", include_contrasts=FALSE, replace=TRUE, factor_data_frame_name="Vill", factor="Village.")

Lunalo commented 7 years ago

I am not sure if @lilyclements has started working on this. if not then i can start working on it....

lilyclements commented 7 years ago

I haven't started working on it yet. If you are free to work on it then feel free to - although let me know if/when you do

dannyparsons commented 7 years ago

Fixing this needs some work in the R method I would guess. Shouldn't be much so happy for you guys to sort it out, let me know if you need any advice.

lilyclements commented 7 years ago

I just wanted to clarify something, because even if John does take this on then I want to still understand this dialog. I don't fully understand to begin with what we want this dialog to do? How can it help the user? I suppose the answer to that question can either lead to more questions, or clear up for me the list written above of items to add to this dialog

Lunalo commented 7 years ago

@lilyclements

I think what the dialog does:-

First, It gets the levels of factor column.

Second , it converts the output into dataframe.

Lastly, It binds the contrast matrix to this already created dataframe.

Lunalo commented 7 years ago

@rdstern

Apart from understanding what the method is doing, I am yet to understand how it will help the user achieve something. I therefore, request for more explanation on its use.

If i now understand the last part, I can now do the editing of the method.

Thanks.

rdstern commented 7 years ago

OK, take the simple survey data that we often use - as an example. Look at the first column, which is village.

Often the data will arrive already at 2 levels. There could be the "Village level" with in formation about the villages. In our case this could just have 4 rows and give the lat, long and altitude of the village. There may be other information, such as the number of households in the village.

Now, if we had that information - with the village name, we could also have a number perhaps associated with each village - say 1, 2, 3, 4!

Now we do a sample survey where we just enter the village number and all the data we currently have 36 observations - at the household level. Then we make village into a factor (in this household file, and the village names could then be attached, using (an extended version of our Levels/Labels dialogue. But then the links between these 2 data frames would enable us to also do much more, e.g. we could show the villages on a map, because we have the additional village information.

We can go one step further "in that direction". That's where we have the factor (or village) data frame first. We might have village information for many villages, and then take a sample of villages for our survey. Now we can still proceed in the same way. This is all part of being able to deal with what is called multi-level data. That's a big and important subject.

Now your dialogue is helping us when we wish to go in the "other direction". We have our data at the household level and we would now like to look at the Village-level data. We start with the basic columns and may then wish to add further information - e.g. geographical position.

And this isn't just about village. In the survey data we have variety - which has 3 levels. In most studies there is additional variety information, e.g. its duration, spreading or upright etc. That is data at the "variety level".

Hope this helps. The confusion is partly because the code is ahead of the help and documentation. I need to work on that.

Cheers

Roger

Lunalo commented 7 years ago

Thanks @rdstern

I now understand .

dannyparsons commented 7 years ago

Seems relevant for future discussions on factor grids.