fwhelan / coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.
GNU General Public License v3.0
92 stars 9 forks source link

Question: Is it best to remove highly conal branches? What is the relationship between D value and p value? #63

Closed martinastoycheva closed 1 year ago

martinastoycheva commented 1 year ago

Hello fwhelar,

Thank you very much for a great software! I just have a couple of questions that I cannot quite find the answer to in the publication.

  1. Is it best to remove genomes from the analysis that belong to a clonal branch of the phylogeny? I ran coinfinder with the output from panaroo and to overcome the zero edge length error I adjusted the 0 length branches by adding a small distance to the edge length variable as in #61. However, I noticed that the network graph output is in two parts and I believe it may be due to the influence of the large clonal branch in the phylogeny. Thus, I am unsure whether removing clonal branches from the coinfinder analysis is better and whether keping them can negatively affect the p and d value calculations.

network.pdf

  1. My second question is about the p value and d value. I am unsure if there should be a correlation between the d and p value statistic. I seem to observe that genes that have a good D value have relatively low p value.

Below I have included an example of the phylogeny used and the lowest (best?) p value gene pair. I am worried as this gene pair seems to show genetic structure and can be considered core for the two branches it belongs to. Am I intepreting this result wrong?

image

Also, is there supposed to be a correlation between low p value and high D value as I am not seein one? coinfinder_p_vs_d_plot.pdf

Best Wishes, Martina

fwhelan commented 1 year ago

Hi Martina,

Thanks very much for your questions.

  1. I think this is up to you and depends on your data. I would imagine that - assuming your phylogeny was made using the core genes - that you could have a case where genomes with identical core genes had different accessory gene profiles; thus nodes that appear identical in the phylogeny might still hold interesting accessory gene information and might not want to be removed. It's hard to tell from the network if the clustering into 3 groups is based on the structure of the core gene phylogeny; what I would recommend instead, if you're worried about the phylogeny structuring your data, is to prune by D-value to ensure that you aren't focussing on genes that are not independently distributed across the phylogeny. Because D-value is very dependent on the phylogeny itself, coinfinder doesn't prune by D but instead expects the user to do so based on a value that makes sense for your data. As an example, we show how we decided on a D-value cutoff in Sup Fig 3 of this recent publication https://academic.oup.com/mbe/article/38/9/3697/6272232.
  2. There shouldn't be a correlation between p-value and D, though I could imagine genes that are vertically transmitted perfectly together might have a very good p-value and also a very strong D. In the example you show, I would expect that after you cull for lineage dependent genes (i.e., decide on and use a good D-value cutoff), that these genes should be filtered out of the analysis and they appear to share a recent common ancestor and to be vertically inherited.
fwhelan commented 1 year ago

Closed due to inactivity but please re-open if you have any other concerns.