SkyYunyun / SubDiv

A pipeline for subgenome dividing within polyploid genomes
1 stars 0 forks source link

Kmeans problem #1

Open Niohuruzh opened 1 year ago

Niohuruzh commented 1 year ago

Hi, When I use Rscript I met a problem:

$Rscript ./bin/clustering_chrs.R distictive_kmer_and_counts cluster_center.pdf dendrogram.pdf

Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Error in kmeans(tscalerawdata, 2, nstart = 30) : 
  more cluster centers than distinct data points.
Execution halted

And I check my distictive_kmer_and_counts file. It is almost empty. But when I change the parameter the lowest different times of subgenome-specific repeat K-mer counts within each homoeologous chromosome pair from 2 to 1. The Rscript works well. And my organism is diploid. Is it OK to set the parameter from 2 to 1?

Looking forward to your reply.

SkyYunyun commented 1 year ago

Hi Niohuruzh,

Thank you for using my script. However, I must inform you that it may not be suitable for your intended purpose. The script was designed specifically to divide alloploid genomes into subgenomes, and it works best when used with African clawed frogs as input. It is able to effectively separate the two subgenomes from each other based on the hypothesis that alloploid genomes originated from hybridization. As a result, the ancestor of each subgenome should differ in repeat counts and consistency, making it possible to use repeat kmers for effective subgenome division.

Unfortunately, this means that the script may not be appropriate for diploid division because it cannot distinguish different repeat kmers within each set of chromosomes. This is why the distinctive_kmer_and_counts file is almost empty.

I hope this helps clarify any confusion, and please let me know if you have any further questions.

Best regards,

Yunyun Lv

SkyYunyun commented 1 year ago

Hi Niohuruzh,

Thank you for using my script. However, I must inform you that it may not be suitable for your intended purpose. The script was designed specifically to divide alloploid genomes into subgenomes, and it works best when used with African clawed frogs as input. It is able to effectively separate the two subgenomes from each other based on the hypothesis that alloploid genomes originated from hybridization. As a result, the ancestor of each subgenome should differ in repeat counts and consistency, making it possible to use repeat kmers for effective subgenome division.

Unfortunately, this means that the script may not be appropriate for diploid division because it cannot distinguish different repeat kmers within each set of chromosomes. This is why the distinctive_kmer_and_counts file is almost empty.

I hope this helps clarify any confusion, and please let me know if you have any further questions.

Best regards,

Yunyun Lv

-----原始邮件----- 发件人:Niohuruzh @.> 发送时间:2023-06-07 16:57:55 (星期三) 收件人: SkyYunyun/SubDiv @.> 抄送: Subscribed @.***> 主题: [SkyYunyun/SubDiv] Kmeans problem (Issue #1)

Hi, When I use Rscript I met a problem:

$Rscript ./bin/clustering_chrs.R distictive_kmer_and_counts cluster_center.pdf dendrogram.pdf

Loading required package: ggplot2 Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa Error in kmeans(tscalerawdata, 2, nstart = 30) : more cluster centers than distinct data points. Execution halted

Is there any suggestion to solve this problem? Thanks! Looking forward to your reply.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Niohuruzh commented 1 year ago

Hi, But the pairs_res_file looks well which has two set of chromosomes. It seems that your script is work well. I upload the pairs_res_file and pdf file which parameter set to 1. Please check them. Thanks pairs_res_file.txt dendrogram.pdf cluster_center.pdf

SkyYunyun commented 1 year ago

Hi Niohuruzh,

The core result file of distinctive_kmer_and_counts is derived from selecting repeat kmers from each pair of homologous groups. The parameter "the lowest different times of subgenome-specific repeat K-mer counts within each homoeologous chromosome pair" is crucial in determining which repeat kmer sequences can effectively divide alloploid genomes into subgenomes. This parameter represents the difference in repeat counts, and setting it to 1 only includes kmers that exist in each homologous pair, which may not provide meaningful results.

For example, if contig1 and contig2 in your paris_res_file are a pair of homologous chromosomes and you want to divide them into two clusters, the result indicating they should be in the same cluster contradicts your assumption. A better approach for dividing a diploid into each haploid would be to align it with its father and mother genome, respectively, and use SNP density for this purpose.

Best regards, Yunyun Lv

-----原始邮件----- 发件人:Niohuruzh @.> 发送时间:2023-06-08 08:55:27 (星期四) 收件人: SkyYunyun/SubDiv @.> 抄送: "Yunyun Lv" @.>, Comment @.> 主题: Re: [SkyYunyun/SubDiv] Kmeans problem (Issue #1)

Hi, But the pairs_res_file looks well which has two set of chromosomes. It seems that your script is work well. I upload the pairs_res_file and pdf file which parameter set to 1. Please check them. Thanks pairs_res_file.txt dendrogram.pdf cluster_center.pdf

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Niohuruzh commented 1 year ago

Hi, Now I know my diploid genome cannot be divided by your script. This diploid strain should be a new species and I don't know its parental strain. The only way to detect this is to use methods based on non-parental genomes.

But in Cluster plot , cont1 is not with cont2.

Best wishes

SkyYunyun commented 1 year ago

Dear Niohuruzh,

Based on what you have shared, I am curious if there is an effective method for dividing the diploid into two haploids without relying on the parental genome. Unless the two haploids in your diploid genome demonstrate some level of stable genetic divergence, I do not believe your results make sense. This is because individuals in a sexual population generally cannot achieve stable genetic divergence on a whole-genome level between sexes.

Best regards, Yunyun Lv

-----原始邮件----- 发件人:Niohuruzh @.> 发送时间:2023-06-08 11:44:16 (星期四) 收件人: SkyYunyun/SubDiv @.> 抄送: "Yunyun Lv" @.>, Comment @.> 主题: Re: [SkyYunyun/SubDiv] Kmeans problem (Issue #1)

Hi, Now I know my diploid genome cannot be divided by your script. This diploid strain should be a new species and I don't know its parental strain. The only way to detect this is to use methods based on non-parental genomes.

Best wishes

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Niohuruzh commented 1 year ago

Hi Yunyun Lv Thanks for your reply. My study is a fungus. And its related strain is haploid. I use the Pacbio sequencing for both of them. The synteny analysis shows there are one-to-two syntenic relationships. And there are two mating types in diploid strain. So whether I can distinguish the subgenome according to syntenic analysis which is the same with pairs_res_file results.

SkyYunyun commented 1 year ago

Hi Niohuruzh,

I have limited knowledge about the repeat sequence in fungi. The designed scripts are used to divide the subgenome in allopolyploid species, such as certain animals and plants. However, the effectiveness of the scripts in fungus genomes has not been tested yet. Therefore, I cannot guarantee that the scripts will be effective for your purpose. However, if your diploid is highly heterozygous, it means that the two haploids within it should be divided from each other, and they may have differences in repeat sequence content and constitution. In this case, my scripts may work well for haploid dividing.

I hope this information helps you.

Best regards, Yunyun Lv

-----原始邮件----- 发件人:Niohuruzh @.> 发送时间:2023-06-08 14:36:25 (星期四) 收件人: SkyYunyun/SubDiv @.> 抄送: "Yunyun Lv" @.>, Comment @.> 主题: Re: [SkyYunyun/SubDiv] Kmeans problem (Issue #1)

Hi Yunyun Lv Thanks for your reply. My study is a fungus. And its related strain is haploid. I use the Pacbio sequencing for both of them. The synteny analysis shows there are one-to-two syntenic relationships. And there are two mating types in diploid strain. So whether I can distinguish the subgenome according to syntenic analysis which is the same with pairs_res_file results.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>