caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
94 stars 18 forks source link

General question about phenotype data #33

Closed WallyL closed 5 years ago

WallyL commented 5 years ago

I have a phenotype set that is numeric and continuous, e.g. running from 0 to 4.51. For example, here are some values on the lower end of the spectrum:

Sample1 0 Sample2 0 Sample3 0 Sample4 0.11 Sample5 0.55 Sample6 0.68 Sample7 1.03 Sample8 1.04 Sample9 1.09 Sample10 1.12 Sample11 1.12 Sample12 1.17 Sample13 1.21 Sample14 1.23 Sample15 1.28 Sample16 1.28 Sample17 1.29 Sample18 1.38 Sample19 1.42 Sample20 1.46 Sample21 1.48 Etc...

Are values such as these acceptable or would I need to modify somehow?

xavierdidelot commented 5 years ago

Hi Walt,

Yes this is acceptable input, but you should ask your self if the values are meaningful in an absolute sense. For example, is a difference of 1.03 between sample1 and sample7 really ten times more 'important' than a difference of 0.11 between sample1 and sample4? If not, or in other words if you want your analysis to be robust to rescaling of the phenotype values, then you could use the rank function on your phenotype values before using them into treeWAS/

Best wishes, Xavier

WallyL commented 5 years ago

Hi Xavier, Thank you for the response. This data is another researcher's and so I really don't know what the relationships are/mean among the data, however, I did run Rank on them and that has consolidated some of the samples, particularly those with value = 0.

I think I am almost ready to try to run; however, I am unable to get the phenotype file in the correct format (my experience with R is a bit limited). I think snp file is ready to go relative to test "snps":

> str(snps)
 num [1:100, 1:20003] 0 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100] "A_1090" "B_1078" "A_1070" "B_1083" ...
  ..$ : chr [1:20003] "1.g" "1.a" "2.t" "2.g" ...
> str(snps.m3.matrix)
 int [1:85, 1:62099] 0 1 0 0 0 NA 1 0 0 NA ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:85] "PANS_1_1" "PANS_1_10" "PANS_1_2" "PANS_1_5" ...
  ..$ : chr [1:62099] "X43.12213.g" "X43.12214.a" "X43.12215.c" "X43.12220.t" ...
> is.null(rownames(snps))
[1] FALSE
> is.null(rownames(snps.m3.matrix))
[1] FALSE

From the experimental datasets, I see the the class of "phen" is factor, so I imported my ranked data as follows (I tried importing 2 ways listed, but neither validates fully- with "phen" result).

> phen.ranked.matrix <- as.matrix(read.table(file="phen_ranked_ordered.txt", sep="\t", header=FALSE, row.names=1))
> phen.ranked.matrix2 <- as.matrix(read.table(file="phen_ranked_ordered.txt", sep="\t", header=FALSE))
> phen_ranked_factor <- as.factor(phen.ranked.matrix)
> phen_ranked_factor2 <- as.factor(phen.ranked.matrix2)
> class(phen_ranked_factor)
[1] "factor"
> class(phen_ranked_factor2)
[1] "factor"
> head(phen_ranked_factor)
[1] 56 56 22 2  4  56
52 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 ... 56
> head(phen_ranked_factor2)
[1] PANS_1_1  PANS_1_10 PANS_1_2  PANS_1_5  PANS_1_6  PANS_1_8 
137 Levels:  1 10 11 12 13 14 15 16 17 18 19  2 20 22 23 24 25 26 27 28 ... PNA_99_9
#Check test data "phen"
> is.null(names(phen))
[1] FALSE
#Check my data
> is.null(names(phen_ranked_factor))
[1] TRUE
> is.null(names(phen_ranked_factor2))
[1] TRUE
> all(names(phen_ranked_factor) %in% rownames(snps.m3.matrix))
[1] TRUE
> all(names(phen_ranked_factor2) %in% rownames(snps.m3.matrix))
[1] TRUE
> all(rownames(snps.m3.matrix) %in% names(phen_ranked_factor))
[1] FALSE
> all(rownames(snps.m3.matrix) %in% names(phen_ranked_factor2))
[1] FALSE

Any idea why the first test validates as TRUE and the third test validates as FALSE?