crtahlin / medplot

Functions for drawing graphs in R visualizing medical information.
4 stars 2 forks source link

Timeline - add a profile plot #58

Closed crtahlin closed 10 years ago

crtahlin commented 10 years ago

Add a profile plot to the timeline. A profile plot is a longitudinal plot showing a line connecting all measurment points. The horizontal axis should show measurement occasions.

Try using ggplot, as it should have (optionaly?) faceting by symptoms.

crtahlin commented 10 years ago

I am prototyping the graph in ggplot. Measurements are used as.factor, missing values are ommited: Warning messages: 1: Removed 1 rows containing missing values (stat_summary). 2: Removed 1 rows containing missing values (geom_path).

The plot looks like this (showing only 4 symptoms): image

The red squares are medians at a certain measurement occasion. The lines are connecting values for a certain individual. It is not very clear, how the values are moving - too much clutter. I think the most revealing are the median values, not the lines themselves. Any opinions? Should I implement it as is - I guess it depends on the data, how cluttered the results are. Some other data might be quite revealing...

llaarraa commented 10 years ago

I think that we should do the following

0 - add a tab in which we display this graph and call it: Distribution of the variables over time - or think a better name. or simply add ( - by measurement occasion in all the tabs where the analyses/synthesis is done for each time occasion separately - would do this)

1 - as this plot might or might not be informative, depending on the number of subjects/times, we give the user two choices, through a drop down menu:

1b - select a random subset (n=X) of the subjects to display 1c- do many graphs, in which at most X patients are displayed my code for 1b/c is reported below (for 1 symtp. , no shiny features included )

1d - do a simple graph that includes a boxplot for each measurement time- in the case of the sympt. it looks ok

############## code that should work, written for 1 symtpom  - except for the call to the call 
input$file1$datapath = "C:/Users/lara/Dropbox/medplot/ForSymptoms/DataEM.txt"
fix the path

input=vector("list")
input$file1=vector("list")
input$file1$datapath = "C:/Users/lara/Dropbox/medplot/ForSymptoms/DataEM.txt"

input$dateVar="Date"
input$patientIDVar="PersonID"
input$measurementVar="Measurement"
input$groupingVar="Sex"

data <-  read.csv(input$file1$datapath, header=TRUE, sep="\t")
input$selectedSymptoms=names(data)[-c(1:8)]

# transform date information into R compliant dates

data["Date"] <- as.Date(data[,"Date"], "%d.%m.%Y")

dataFiltered=data

input$selectedSymptoms

j=10

    matplot(dataFiltered[,input$measurementVar],  dataFiltered[,j], lty=2, type="n", xlab="Measurement", ylab=names(dataFiltered)[j])

for(my.id in unique(dataFiltered[,input$patientIDVar])){
    temp.data=dataFiltered[is.element(dataFiltered[,input$patientIDVar], my.id)   ,]

    #add a bit of noise on the x-axis
    j.data=jitter(temp.data[,input$measurementVar])

    matlines(j.data,  temp.data[,j], lty=2, col=1:10)
    matpoints(j.data,  temp.data[,j], lty=2, pch=1)

}

j=10

#random sample a subset of patients, say 20 
num.displayed=20
    which.t0=which(dataFiltered[, input$measurementVar]==min(dataFiltered[, input$measurementVar], na.rm=T))
    num.samples.t0=length(which.t0)
which.use=sample(num.samples.t0, num.displayed)

    matplot(dataFiltered[,input$measurementVar],  dataFiltered[,j], lty=k, type="n", xlab="Measurement", ylab=names(dataFiltered)[j])

    k=0
for(my.id in unique(dataFiltered[,input$patientIDVar])[which.use]){

    k=k+1
    k=ifelse(k<10, k+1, 1) #use different colors and dashed lines - set back to k=1 to get all black and solid lines

    temp.data=dataFiltered[is.element(dataFiltered[,input$patientIDVar], my.id)   ,]
    matlines(temp.data[,input$measurementVar],  temp.data[,j], lty=k, col=k)
    matpoints(temp.data[,input$measurementVar],  temp.data[,j], lty=k, pch=1)

}
############# end of random sample example

############### select the maximum number of patients per graph, patients are grouped based on the value of the variables at t=0

j=10

which.t0=which(dataFiltered[, input$measurementVar]==min(dataFiltered[, input$measurementVar], na.rm=T))
num.samples.t0=length(which.t0)

num.patients.per.graph=10
num.graphs=ceiling(num.samples.t0/num.patients.per.graph)

my.breaks=round(seq(1, num.samples.t0+1, length.out=num.graphs))

#par(mfrow=c(ceiling(num.graphs/2) , 2)) #reset back, does not work in R figure margins too large, will work in the browser if the figure is set big enough

    for(i in 1:(num.graphs-1)){

#   which(rank(dataFiltered[which.t0,j])<=num.samples.t0/4)
    which.names.use=dataFiltered[which.t0,input$patientIDVar][rank(dataFiltered[which.t0,j], ties="first")>=my.breaks[i] & 
    rank(dataFiltered[which.t0,j], ties="first")<my.breaks[i+1] ]

matplot(dataFiltered[,input$measurementVar],  dataFiltered[,j], lty=2, type="n", xlab="Measurement", ylab=names(dataFiltered)[j])

    k=0

for(my.id in which.names.use){
k=k+1
k=ifelse(k<10, k+1, 1)
    temp.data=dataFiltered[is.element(dataFiltered[,input$patientIDVar], my.id) ,]
    j.data=jitter(temp.data[,input$measurementVar])
    j.data.y=jitter(temp.data[,j])
    #matlines(j.data,  temp.data[,j], lty=k, col=k)
    #matpoints(j.data,  temp.data[,j], pch=1)
    matlines(j.data,  j.data.y, lty=k, col=k)
    matpoints(j.data,  j.data.y, pch=1)

}#end for my.id

}#end for i
crtahlin commented 10 years ago

Code for ggplot (random sample included):

# load libraries
library(ggplot2)

# load data - Crt
dataTest <- read.csv("C:/Users/Crt Ahlin/Documents/Dropbox/medplot_shared_Crt/ForSymptoms/DataEM.txt",
         header=TRUE, sep="\t")

# load data - Lara
dataTest <- read.csv("C:/Users/lara/Dropbox/medplot/ForSymptoms/DataEM.txt",
                     header=TRUE, sep="\t")

# draw sample
sizeofSample <- 10
peopleInSample <- sample(unique(dataTest[,"PersonID"]), sizeofSample)
dataRandomSample <- dataTest[dataTest[, "PersonID"] %in% peopleInSample, ]

# prepare data
dataMelted <- melt(data=dataRandomSample, id.vars=c("Measurement", "PersonID"), measure.vars=c("Fatigue","Malaise","Headache", "Insomnia") )

# set some variables as factors
dataMelted[,"PersonID"] <- as.factor(dataMelted[,"PersonID"])
dataMelted[,"Measurement"] <- as.factor(dataMelted[,"Measurement"])

# code to draw graph
  # define x, y axis, groups, coloring
p <- ggplot(data=dataMelted, aes(x=Measurement, y=value, group=PersonID, colour=PersonID)) +
  # draw points, draw lines, facet by symptom, use black & white theme
  geom_point() + geom_line() +  facet_grid(variable~.) + theme_bw() +
  # add summary statistics at each point
  stat_summary(aes(group=1), geom="point", fun.y=median, shape=15, size=5, colour="red") 
# plot
print(p)  

image

crtahlin commented 10 years ago

Answers:

0 - add a tab in which we display this graph and call it: Distribution of the variables over time - or >think a better name. or simply add ( - by measurement occasion in all the tabs where the >analyses/synthesis is done for each time occasion separately - would do this)

Ok. Almost all tabs (except Timeline and Distribution: by grouping variable) are actually by measurement occasion. So I will add ": by measurement occasion" to all of them. And name this new one " Distribution of the variables: over time", to keep thing consistent.

1 - as this plot might or might not be informative, depending on the number of subjects/times, we >give the user two choices, through a drop down menu: 1b - select a random subset (n=X) of the subjects to display 1c- do many graphs, in which at most X patients are displayed my code for 1b/c is reported below (for 1 symtp. , no shiny features included )

Ok, I can do this relatively quickly in ggplot (as I have the concept how to do it in my head). I also can relatively quickly implement #63 and any other additional faceting in ggplot, so I am rooting for the ggplot solution.

  1. should the size of the subset (X) be user selectable?
  2. I guess the horizontal axis should have measurment occasions (not days) ? With measurement occasions it makes more sense to connect them with lines, than with days (as it seems we are interpolating if we have days on the horizontal axis).

1d - do a simple graph that includes a boxplot for each measurement time- in the case of the >sympt. it looks ok

Great. This will probably be a lot less cluttered and helpful. On the same (new) tab as the profile plot?

llaarraa commented 10 years ago

1 - Yes, I would make it selectable (within the tab, default: displays all, if selected, displays a subset) 2 - this type of graph by dates would not make much sense, could make sense for days since enrollment 1d- if space permits, I woul put the boxplot on the same tab

llaarraa commented 10 years ago

In Distribution of the variables over time we should add also a table in which we summarize the data over all measurement occasions - something as the table appearing in distributions of the variables - by measurement occasion, just putting all the data toghether

crtahlin commented 10 years ago

Have implemented 1, 1a, 1b in cf30eb20b781c97c1b8d06e4b8041f94718ad4d0 . Since the graph takes quite a long time to plot, I have made 1a the default choice. If needed, I can change that quickly.

1d and table yet to be implemented.

crtahlin commented 10 years ago

1d (boxplots) done in 023582795293e773eeb23c641ee7ed790d6bb046 . I have put them in a separate tab with the working name "?Variables: over time w boxplots?". We should perhaps rename all the "Distribution of variables: ..." tabs into "Variables: ...", to take less space?

Table remains to be implemented.

crtahlin commented 10 years ago

Tables under boxplots were done in #80 , closing.