caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
92 stars 18 forks source link

Warning when comparing columns in snps.reconstruction and snps matrix. #47

Closed arturotorreso closed 4 years ago

arturotorreso commented 4 years ago

Hi @caitiecollins @xavierdidelot ,

I am using treeWAS with my own SNP and Phenotype reconstruction, but I keep getting this warning:

"The number of columns in the provided snps.reconstruction is not equal to the number of columns in the snps matrix. Performing a new parsimonious reconstruction instead."

I get this error even when the snps matrix and the snps.reconstruction matrix have the same column number. I spotted a couple of lines in the code of treeWAS.R that might be causing this:

In line 1811 : https://github.com/caitiecollins/treeWAS/blob/53042d518104e63e075c8b5a2ce2d82680d474d9/R/treeWAS.R#L1811

The snps matrix is replaced by the snps matrix with unique patterns in the columns:

And in line 1845: https://github.com/caitiecollins/treeWAS/blob/53042d518104e63e075c8b5a2ce2d82680d474d9/R/treeWAS.R#L1845

The number of columns of snps.reconstruction matrix is compared with the snps matrix. But at the time they are compared, if there are any repeated patterns in the snps matrix they will be removed and then it will have less columns than the snps.reconstruction matrix.

I used a small tree and alignment to check this out, and indeed when I use snps and snps.reconstruction matrices with 1 pattern repeated in two columns, I get the warning. When that column is removed, the warning is not raised. To avoid this error, snps in line 1845 could be replaced by snps.complete created in line 1809.

This also got me thinking about whether the SNP matrices I give to treeWas should indeed only have unique patterns anyway, but I'm not sure this would be the right approach.

Let me know your thoughts, Arturo

caitiecollins commented 4 years ago

Hi Arturo,

I'll take a look at this and get it sorted out as soon as I can. Might you be able to send me any sample data that you know generates this error?

Thank you. Best, Caitlin.

arturotorreso commented 4 years ago

Hi Caitlin,

Thanks for your quick response. I have been checking it out using a toy tree and phenotypes. I will attach it here. Please be aware it may not make the most sense! But it helps illustrate the issue.

Column 1 and 3 have the same pattern.

You can run it as:

treeWAS( snps = sim.data$snps, snps.reconstruction = sim.data$snps.reconstruction, phen = sim.data$phen, phen.reconstruction = sim.data$phen.reconstruction, tree = sim.data$tree, phen.type = "discrete")

This will produce a warning. The following won't:

treeWAS( snps = get.unique.matrix(sim.data$snps)$unique.data, snps.reconstruction = get.unique.matrix(sim.data$snps.reconstruction)$unique.data, phen = sim.data$phen, phen.reconstruction = sim.data$phen.reconstruction, tree = sim.data$tree, phen.type = "discrete"

Changing snps by snps.complete in line 1845 solves the warning message, but of course I don't know whether this creates unintended consequences or not!

Thanks for your help, Arturo

sim.data.RData.zip

caitiecollins commented 4 years ago

Hi Arturo,

Thank you for sending me your sample data. And thanks for describing the problem so clearly. You were certainly right about where the error was coming from.

I made a few changes, so now it should be able to handle your input snps.reconstruction data without giving you that error. You’ll need to re-download and re-install the package first, of course.

Before I let you go off on your merry analytical way though, I want to make sure I mention a few important things:

First, I should note that while it may look like treeWAS is getting rid of any non-unique columns in the lines you mentioned, it’s not actually discarding these columns. There’s just no benefit to running them through the calculations multiple times if they are truly identical. treeWAS streamlines both matrices into sets of unique columns only for computational efficiency purposes. However, it also does keep a record of the original index of all columns so that all relevant information can be re-expanded (s.t. association scores are assigned to each input column) before we assess the significance of the associations present in one of the final steps of the analysis.

Also, I should say that although you can provide a snps.reconstruction as an input to treeWAS, it’s not something I would generally encourage. The reason being that treeWAS needs to compare the snps reconstruction to the reconstruction that it does for the simulated snps, so it is best that they be done in as similar a way as possible. As such, if you can use the internal snps reconstruction functions instead of providing your own reconstruction, this is better for consistency’s sake. But I acknowledge that one may be motivated to do otherwise, perhaps for speed, or analytical or personal preference. With that said, if you do prefer to use an external reconstruction approach, I’d be curious to hear which one and would be happy to discuss how to achieve max/sufficient consistency between the external and internal reconstruction approaches.

Please let me know how it goes. And don't hesitate to ask if you run into any other issues or have any questions. Happy treeWASing!

All the best, Caitlin.