albertopessia / Kpax3.jl

Bayesian bi-clustering of categorical data
MIT License
8 stars 7 forks source link

How to set the parameters of Kpax3 in order to get the robust cluster results? #4

Closed virologist closed 4 years ago

virologist commented 4 years ago

Hi, Albert

The kapx3 draws my attention since it clusters the protein based on the amino acid sequences by using the Bayesian approach. Usually, researches arbitrarily use the nucleotide difference to define genotypes. e.g. using MEGA to divide the clade based on the genetic distance and a cutoff value (≥ 2.5%) used to define a new clade/genotype. Hence, I'd like to give it a try to use to cluster the antigen clade/genotypes. I was run several times, but the number of clusters was not consistent. My dataset (450+ AA sequences) was divided into 10-12 clusters among these running tests. My question is how can I set the parameter to get the robust results? Based on your tutorial, I set the T=10^6, burnin 20%(200000), and sample every 100 iterations. Is that ok? What about the others? like popsize, maxiter things like this?

**T=1000000**, **burnin=200000**, **tstep=100**, op=[0.5; 0.0; 0.5], popsize=20, maxiter=20000, maxgap=5000, xrate=0.9, mrate=0.005);

I get the clade_posterior_k.csv file. Is this means cluster=11(0.70842) is preferred? "k","p(k | x)" 9,0.110345 10,0.006755 11,0.70842 12,0.17293 13,0.00155

Best regards, Yan

albertopessia commented 4 years ago

Hi Yan,

if you are only interested in the clusters, I suggest to use the MAP (maximum a posteriori) rather than the MCMC estimator. The MAP is more robust as a point estimate. As you noticed, MCMC estimate largely depends on the sample (it is still an open problem how to find the best clustering from MCMC samples).

However, MCMC can give you an idea of the uncertainty of your MAP estimate. Here are some strategies:

Kpax3 provides functions to plot MCMC estimates, check them out in the tutorial. They are kpax3plotk, kpax3plotp, and kpax3plotc.

Once the prior distribution is fixed, the MAP is unique and you should always get the same solution. However, the problem is difficult and you might get multiple solutions from different runs only because the algorithm get stuck on local optima.

As it usual in this cases, run the kpax3ga function multiple times from different starting partitions and select the one with the highest posterior probability.

popsize, maxiter, maxgap, xrate, and mrate are tuning parameters to the genetic algorithm for discovering the MAP estimate. Default values should work fine for most applications.

Regarding MCMC, the most important parameter is the number of steps T.

virologist commented 4 years ago

Hi, Albert,

Thanks for your detailed reply. That's quite informative and helpful. Pl let me try to sum up with respect to your reply for the instruction of the operation work: 1.keeping the default parameters setting expect sets the T=1M; 2.running 10 times and pick up the one which has the highest posterior probability. If I was wrong or missed something, Pl corrects me. Another question, How to plots the clade_posterior_R.csv? Is it looks like the Figure 1.B in your paper? Sorry about this tiny question, it may annoy you. Kpax3 is the first and only Julia package I had run, I still learning with it. I try to uncomment the part of the plot code in your tutorial. But, obviously, it does not work. Like this

function kpax3plotk(k::Vector{Int}, pk::Vector{Float64}) kpax3plotk(k, pk) Plots.savefig("clade_posterior_k.pdf")

function kpax3plotc(site::Vector{Int}, freq::Vector{Float64}, C::Matrix{Float64}) kpax3plotc(site, freq, C) Plots.savefig("clade_posterior_C.pdf")

Again, Thank you very much.

Best regards, Yan

albertopessia commented 4 years ago
  1. Default value of T is 100_000, but the more the better. T = 1_000_000 will take much (much) longer but if you can wait, the approximation will be more accurate. To save time, start with the default T = 100_000. If results look promising, rerun with T = 500_000 or T = 1_000_000 to get more accurate numbers.

  2. Yes, 10 times should be enough. Here is a possible strategy to automate the clustering initialization:

    
    # Your dataset is here called `clade_data`

First approach: use the provided function.

See https://github.com/albertopessia/Kpax3.jl/blob/master/tutorial/Kpax3_fasta_tutorial.jl at row 227

partition_a = initializepartition(settings);

map_solution_1 = kpax3ga(clade_data, partition_a, settings); map_solution_2 = kpax3ga(clade_data, partition_a, settings); map_solution_3 = kpax3ga(clade_data, partition_a, settings);

if they all converged to the same solution, it is already a good hint that it might be the true solution

second approach: build a hierarchical clustering and cut the tree at different lengths.

we need packages Distances and Clustering, install them with ]add Distances Clustering

using Distances, Clustering;

our data was converted to binary format, so we use the Hamming distance

D = pairwise(Hamming(), clade_data.data, dims = 2); H = hclust(D, linkage = :ward);

map_solution_4 = kpax3ga(clade_data, cutree(H, k = 8), settings); map_solution_5 = kpax3ga(clade_data, cutree(H, k = 9), settings); map_solution_6 = kpax3ga(clade_data, cutree(H, k = 10), settings); map_solution_7 = kpax3ga(clade_data, cutree(H, k = 11), settings); map_solution_8 = kpax3ga(clade_data, cutree(H, k = 12), settings); map_solution_9 = kpax3ga(clade_data, cutree(H, k = 13), settings); map_solution_10 = kpax3ga(clade_data, cutree(H, k = 14), settings);

select the best solution

maybe not the best elegant way, but it does the job

i = findmax([ map_solution_1.logpp; map_solution_2.logpp; map_solution_3.logpp; map_solution_4.logpp; map_solution_5.logpp; map_solution_6.logpp; map_solution_7.logpp; map_solution_8.logpp; map_solution_9.logpp; map_solution_10.logpp ])[2]; map_solution_best = if i == 1 map_solution_1 elseif i == 2 map_solution_2 elseif i == 3 map_solution_3 elseif i == 4 map_solution_4 elseif i == 5 map_solution_5 elseif i == 6 map_solution_6 elseif i == 7 map_solution_7 elseif i == 8 map_solution_8 elseif i == 9 map_solution_9 elseif i == 10 map_solution_10 end;

writeresults(clade_data, map_solution_best, "path_to_output_file_without_extension", what = 4, verbose = true);

you can now start the MCMC algoritihm from the best MAP solution, either by

directly using the vector map_solution_best.R, or by giving the path to the partition file

we just saved.


3. No need to uncomment. Here is an example:
```{julia}
using Plots;
gr();

# this shows the complete dataset highlighting the clusters and the important amino acids
#
# if the total number of coloured sites is less than the total number of clusters, it means
# that the clustering is not very reliable
#
# set `legend = true` to see the actual amino acids
kpax3plotd(clade_data, map_solution_best, legend = false)

# Load posterior probabilities:
(k, pk) = readposteriork(output_file);
(id, P) = readposteriorP(output_file);
(site, aa, freq, C) = readposteriorC(output_file);

# check that the chain is well mixed. Line should be randomly moving up and down
# without any clear pattern
(entropy_C, avgd_C) = traceC(output_file, maxlag = 50);
kpax3plottrace(entropy_C)

# Posterior distribution of the number of clusters
# The total number of clusters `map_solution_best.k` should be within the histogram range,
# hopefully equal to the most likely `k`.
kpax3plotk(k, pk)

# Posterior distribution of adjacency matrix
# Hopefully your clusters are well represented by black boxes, while everything else is
# almost white
kpax3plotp(map_solution_best.R, P)

# Posterior distribution of features types
# The darker the plot the better. If the protein is very long, this plot might not be
# very informative because most sites will be classified as noise. If this is the case,
# check the csv file as I suggested in the previous reply.
kpax3plotc(site, freq, C)

Cheers, Alberto

P. S. I hope no typos are present in my sample code, but if you find them and cannot fix them just let me know.

albertopessia commented 4 years ago

Since there are no more updates on this issue, I am closing it. Always ready to reopen if needed.