BioAlgs / MetaGen

BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Problem with the number of clusters estimation #3

Open QuentinLetourneur opened 6 years ago

QuentinLetourneur commented 6 years ago

I ran MetaGen with default parameters on a simulated dataset containing 40 bacterial species but the optimal number of clusters found was very low (5).

I raised the bic_min option to 22 to avoid something like a local minimum at 5 and the bic_step to 5 but the determined optimal number of cluster was still 5.

Here is the MetaGen output :

Initializing... Initialization finished. Selecting the number of clusters ... The searching range for the number of clusters is from 22 to 49 with step size 5 Running time for selecting number of clusters 19.17818The optimal number of cluster is 5 number of iterations= 9 The BIC score for 5 clusters finished

I'd like to have your thoughts on the matter.

Thanks in advance,

Quentin

BioAlgs commented 6 years ago

Thank you for your feedback. In my simulation setting, we only have 40 samples with 100 bacteria species. We do not have 40 bacterial species simulated data.

I re-ran the MetaGen on 120x-40-100sp data and the following is my output. Could you specify which data you are using?

[image: Inline image 1]

Have a nice holiday! Xin Xing

On Wed, Dec 20, 2017 at 12:07 PM, QuentinLetourneur < notifications@github.com> wrote:

I ran MetaGen with default parameters on a simulated dataset containing 40 bacterial species but the optimal number of clusters found was very low (5).

I raised the bic_min option to 22 to avoid something like a local minimum at 5 and the bic_step to 5 but the determined optimal number of cluster was still 5.

Here is the MetaGen output :

Initializing... Initialization finished. Selecting the number of clusters ... The searching range for the number of clusters is from 22 to 49 with step size 5 Running time for selecting number of clusters 19.17818The optimal number of cluster is 5 number of iterations= 9 The BIC score for 5 clusters finished

I'd like to have your thoughts on the matter.

Thanks in advance,

Quentin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BioAlgs/MetaGen/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-hJpjZX-doOQzjQl4NxqNizaB_YORxks5tCT68gaJpZM4RIsTc .

-- Xin Xing Department of Statistics University of Georgia

QuentinLetourneur commented 6 years ago

Thanks for your reply,

I can't see the image you sent.

I forgot some important details about the dataset that I used : it's composed of 30 samples each containing 30 bacterias sampled from a list of 40 bacterias. The coverage is at least of 50x for all genomes

It's a simulated dataset that I created from genomes took in the NCBI.

What is confusing is that I have assembled this dataset with CLC with and without doing the scaffolding step. In the first case MetaGen works fine but I have the issue I mentioned in the second case. The metrics between these 2 assemblies aren't that different so I don't think there was a problem with it.

Quentin