VGP / vgp-assembly

VGP repository for the genome assembly working group
Other
185 stars 51 forks source link

genomescope evaluation #50

Closed zhoudreames closed 3 years ago

zhoudreames commented 3 years ago

in my project,the Genome size, heterozygosity, and repeat contents were estimated with GenomeScope,but The results vary greatly.in order to make sure i got it right ,I download human HG002 to estimate those indicator.i know the human genome size approximate to 3100000000,but i dont know the size of genome repeat content. in your VGP webisites ,i see your team assemblied the human genome, could you tell me the size/rate of human repeat content ? thank you so much!

zhoudreames commented 3 years ago

@Arkarachai could you help me ? thank you!

Arkarachai commented 3 years ago

I think this is a more genomescope question than our pipeline question. Could you contact the developer? From our experience, if you use the k-mer size we recommend, the results are relatively stable.

Arkarachai commented 3 years ago

More importantly, this tool just gives you a guideline for the parameter setting of other tools. However, those parameters setting are not super sensitive. For example, in Falcon, it just helps you estimate amount of data for error correction. The Falcon doesn't generate genome size based on what you told it. For some steps that we use Meryl, we just need to know the range to help you choose k-mer size. Usually, mammalian or bird don't have much variation in genome size. I don't know what species you are trying to assemble. There might be some guidelines established for that too.

zhoudreames commented 3 years ago

super sensitiv my species is pig,when I set k-values is 21,i got the Repeat rate is 27 %.but when i change the k-value=31,the repeat rate is 18%.so i dont know which is correct or all wrong. so the result vary so differernt disturbing me.

Arkarachai commented 3 years ago

I see. You are more interested in repeat content estimated from genome scope rather than our use for genome scope in VGP pipelines (which is to estimate genome size). IMO, it's not odd that you have different repeat content from different k-mer size. Repeat is relative. the larger k-mer, the lower repeat content. I personally only use it as relative among species, but other people might have different opinions.

Arkarachai commented 3 years ago

just to clarify, my 'stable' here means 'stable' given different data files of the same species and 'stable' for making the decision to choose parameters for pipelines.

zhoudreames commented 3 years ago

solved thank you for your help~ @Arkarachai