WGLab / RepeatHMM

a hidden Markov model to infer simple repeats from genome sequences
Other
34 stars 14 forks source link

Result interpretations! #27

Closed GraceGrace1007 closed 4 years ago

GraceGrace1007 commented 5 years ago

Hi there,

I run the RepeatHMM and got the result like the following:

p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr:2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1]

I really confused about the number" 10.666666666666666", so what does it mean? And how can I know the detailed repeat sequence such like 'AAAAATTTC' . And another question is if RepeatHMM can detect the STR on genes which are not included in the gene list? Cause I tried some genes which are not in the list but I failed.

liuqianhn commented 5 years ago

Hi @GraceGrace1007 , sorry for late reply.

  1. The number is calculated according to the input (repeat counts in the reference genome). I will format the output of this number.
  2. RepeatHMM now gives the estimation of repeat counts for the input region of interest with repeat motifs. RepeatHMM does not output repeat sequences here.
  3. RepeatHMM needs the regions of interest and a repeat motif as input, no mater where is the region and what is the repeat motif. You can give any repeat in a region as you are interested in. I do not know your input parameters, and thus have no idea about what you mean 'failed'. It would be great if you can give some detail. Thank you.
GraceGrace1007 commented 5 years ago

Hi liuqianhn,

Thanks for your reply! Now I am trying to interpret all the detailed information in the result line.And I find some questions showed below:

1) as you said the number" 10.666666666666666" shows the repeat counts in the reference genome. There are two lines in my result: in the region X [66765160, 66765228], the result is "p2sp ['ar', 23.0, [0, 0], 'allocr:0:12, 18:2, 21:2, 25:1,', 17, 1]", in order to check the the repeat counts in the reference genome, I used samtools to check the motif in this region on the reference genome like the following:

X:66765160-66765228 CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAA the last one is CAA not CAG, but the counts is 23.0 not 22.5, so how does the RepeatHMM calculates the repeat counts in the reference genome?

2) you showed that [XX, YY] is the estimated allele size, followed by a distribution of all estimated repeat sizes. eg: p2sp ['csf1po', 19.25, [16, 17], 'allocr:0:1, 15:3, 16:5, 17:4, 18:3,', 16, 2] So in the above line, I am still confused about the meaning of [16, 17]? And What is the difference between [0,0] and [16, 17]?

3) in the result line, eg: 'p2sp ['csf1po', 19.25, [16, 17], 'allocr:0:1, 15:3, 16:5, 17:4, 18:3,', 16, 2]' what does the meaning of the last two numbers "16,2" ?

4) for the final result, do I need to do some filtration to remove the results with lower accuracy? Do you have some suggestions?

liuqianhn commented 5 years ago

Hi @GraceGrace1007 ,

  1. RepeatHMM calculates the repeat counts for the reference solely based on users' inputs: the repeat region in the reference genome and the length of repeat motif.
  2. A minimum threshold to support each estimation of repeat count is required, and there is also default value in RepeatHMM. If no estimated repeat count have enough supporting reads (not least than the threshold), [0,0] will be output; otherwise, the estimated counts is provided.
  3. '16' is the total number of reads in the distribution.
  4. Right now, I prefer to use the number of supporting reads for filtering. This is because no strong conclusion can be given if only few reads are detected for the repeat regions. Meanwhile, GMM is used to detected the distribution, and no enough reads in the distribution will also affect the estimation by GMM. Feel free to let me know if anything is not clear here.
GraceGrace1007 commented 5 years ago

Hi liuqianhn,

Sorry about my tiresome questions! I think I got what you said but I really want to double check for that.

I want to use two result line as examples: 1) p2sp ['d10s1248', 13.25, [0, 0], 'allocr:11:2, 12:2, 13:3, 14:2, 15:2, 16:1,', 12, 1] 2) p2sp ['d17s1301', 12.25, [11, 12], 'allocr:0:1, 11:4, 12:10, 13:3, 14:2, 15:1,', 21, 0]

  1. So from your reply, compared with the two lines, if I wan to do filtering, it would be better to keep the lines which have support reads such as the 2nd line with [11, 12] reads on two alleles? I am still confused about the meaning of [X,Y] and your answer about the 2nd question.

  2. In the third question, you said that '16' is the total number of reads in the distribution, what about the following number '2'? And in the above two lines, the last number are 1 and 0, what does that mean?

liuqianhn commented 5 years ago

@GraceGrace1007 , it is ok, and feel free to ask questions when using RepeatHMM.

  1. [X,Y] represents repeat counts for two alleles. The filter of supporting reads is your choice, which might depend on your data: if your data has high coverage (such as >100X), it would be better to have larger threshold, and if your data has low coverage (such as ~10X), it will be better if you can choose smaller threshold of supporting reads. Thus, there is gold-standard to have optimal threshold.
  2. 0/1 indicates the number of incorrect alignment.
GraceGrace1007 commented 5 years ago

Hi liuqianhn,

Thanks for your reply! I am sorry that I cannot put all my questions at once, cause I always find some new questions.

First of all, we talked about the minimum threshold to support each estimation of repeat count and there is also default value in RepeatHMM, so I guess that the parameter ''--MinSup'' with default=5 controls the minimum threshold, Am I right? If it is true, that means 'no estimated repeat count have more than 5 supporting reads, [0,0] should be output', but in my result, I got lines like the following, which with different reads in the distributions, the first one is '3' which is less than 5, another one is '6' which is more than 5 reads. p2sp ['ar', 23.0, [0, 0], 'allocr:0:2, 15:1,', 3, 0] p2sp ['aff2', 29.0, [0, 0], 'allocr:0:5, 23:1,', 6, 0] But both of them have the output of [0,0], so why?

And also, I found that in the final result, some lines started with gene name, like p2sp ['aff2', 29.0, [0, 0], 'allocr:0:5, 23:1,', 6, 0], AFF2 is a gene, but some lines are started with gene locus, like p2sp ['d10s1435', 15.5, [14, 15], 'allocr:10:1, 13:1, 14:2, 15:8, 16:1,', 13, 2], d10s1435 is a gene locus, so why the output has the different resolutions? And I want to transfer the gene locus to genes, Do you have some suggestions to do this?

All the best

liuqianhn commented 5 years ago

@GraceGrace1007 , feel free to post your concerns.

  1. You are right: --MinSup is used to control whether enough evidence is enough to support a repeat count estimation, but this threshold is used for each repeat count estimation not for all reads detected. In your example, 6 reads are found for aff2 , but 5 support 0 repeat counts which is noisy, and only 1 read supports 23 estimation, and 1 read is not enough evidence. ar has the similar situation as aff2
  2. My apologies for the different output of gene name and gene locus. Do not worry about this difference. This is just a name for users' understanding, and the detail of repeat is defined by --UserDefinedRepeat. That is, you can given any name for the different output (gene name or gene locus or any name you want or even aaaaa/bbbb), which does not affect the estimation, since they are something users can understand only.

If you want that output similarly always, you can specify --repeatName for a specific locus, or revise the names in RepeatHMM/bin/reference_sts/hg38/hg38.predefined.pa. Hope it answer your question.

Feel free to ask questions. All the best

GraceGrace1007 commented 5 years ago

Hi liuqianhn,

I want to add some regions which I focused on in the config file RepeatHMM/bin/reference_sts/hg38/hg38.predefined.pa, but in the *.pa file, there are 7 columns, I am not sure the meaning of the last three columns (5.strand, 6.range, 7.others), so could you please tell me how can I prepare for that ?

All the best!

liuqianhn commented 5 years ago

Hi @GraceGrace1007 ,

  1. "5.strand" indicate the pattern is in the reference or in the complementary sequence of reference. + means the repeat motif matches the reference sequence, and - mean it matches the complementary sequence of reference. The number following +/- is not used but for users' understanding.
  2. you can give any number for "6.range" and "7.others", since they are all not used in the analysis now. It might be used in improvement version of RepeatHMM. Hope your questions are answered. Please let me know if anything is not clear.
GraceGrace1007 commented 5 years ago

Hi liuqianhn,

Thanks for your reply!

All the best!

GraceGrace1007 commented 4 years ago

Hi liuqianh,

I tried to add some regions which I focused on. The following lines show the regions I added in the *.pa file,

AADACL2-AS1,3,151548071,151548094,ACAA,6,, AADACL2-AS1,3,151549641,151549655,AAAGA,3,, AADACL2-AS1,3,151552946,151552981,AGAT,9,, AADACL2-AS1,3,151554382,151554396,AAAAT,3,, AADACL2-AS1,3,151556631,151556650,AAAC,5,, AADACL2-AS1,3,151567876,151567890,TTG,5,, AADACL2-AS1,3,151580443,151580460,ACA,6,, AADACL2-AS1,3,151606794,151606813,TTTA,5,, AADACL2-AS1,3,151615223,151615238,AAAT,4,, AADACL2-AS1,3,151627851,151627865,ATTTT,3,,

But in my result, I only got one result on AADACL2-AS1, like the following so why? p2sp ['aadacl2-as1', 3.0, [0, 0], 'allocr:3:9, 9:2,', 11, 0] Also I found something wired that based on our above discussion, there are 9 reads support 3, and in the reference genome, the repeat counts are "3", so [3,3] should be the output instead of [0,0] am I right? I got many lines like this, which [0,0] is the output : eg. p2sp ['abcf1', 4.0, [0, 0], 'allocr:3:3, 4:9, ', 12, 0] p2sp ['abhd12b', 4.0, [0, 0], 'allocr:4:7, 5:1,', 8, 1]

And I really want to know the differences between the reference and my samples, so for me, I think it would be better to focus on the output with [0,0], which means that there are more reads support no estimated repeat counts which are different with the reference genome. But in my result, there are a lot of [0,0] other than [X,X].

All the best!

liuqianhn commented 4 years ago

Hi @GraceGrace1007 , the output is expected, since I set a threshold of minimum repeat count is 5. Any repeat count prediction <5 will be ignored, since they are too small. This setting is at line "176" with "minrepcount = 5;" in "myGaussianMixtureModel.py". You might set minrepcount = 2; in your prediction if you are interested in those very small repeat count.

[0,0] does not mean the existing allele are different from the reference, but means that there is no strong evidence for repeat count inference. If you are interested in the difference of repeat counts in tested samples and the reference, you might need to investigate the predicted repeat counts with the reference.

JinyanFeng0217 commented 4 years ago

Hi there,

I run the RepeatHMM and got the result like the following:

p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr:2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1]

I really confused about the number" 10.666666666666666", so what does it mean? And how can I know the detailed repeat sequence such like 'AAAAATTTC' . And another question is if RepeatHMM can detect the STR on genes which are not included in the gene list? Cause I tried some genes which are not in the list but I failed.

Hi @GraceGrace1007 , sorry for late reply.

  1. The number is calculated according to the input (repeat counts in the reference genome). I will format the output of this number.
  2. RepeatHMM now gives the estimation of repeat counts for the input region of interest with repeat motifs. RepeatHMM does not output repeat sequences here.
  3. RepeatHMM needs the regions of interest and a repeat motif as input, no mater where is the region and what is the repeat motif. You can give any repeat in a region as you are interested in. I do not know your input parameters, and thus have no idea about what you mean 'failed'. It would be great if you can give some detail. Thank you.

hi there,

I run the RepeatHMM and got the result like the following:

p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr: 2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1];

with so many info as 2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1; how to decide the correct repeat units? I mean whether it is true that the more support reads the more trustworthy of the repeat units?

Thanks for your reply!waiting for your comments~

Have a good day

liuqianhn commented 4 years ago

Hi @JinyanFeng0217 , more supporting reads usually suggest a reliable estimation. According to your output, it seems that the coverage is low, and the number of supporting reads is lower than a threshold for all estimated repeat counts.

If you still want to infer estimated repeat counts from lower coverage data, the repeat estimation from your ouput might be 11 and 15 since the two estimated repeat counts have 3 supporting reads. However, lower coverage might indicate less reliable estimation in your output.

JinyanFeng0217 commented 4 years ago

Hi @JinyanFeng0217 , more supporting reads usually suggest a reliable estimation. According to your output, it seems that the coverage is low, and the number of supporting reads is lower than a threshold for all estimated repeat counts.

If you still want to infer estimated repeat counts from lower coverage data, the repeat estimation from your ouput might be 11 and 15 since the two estimated repeat counts have 3 supporting reads. However, lower coverage might indicate less reliable estimation in your output.

thanks so much. I will check the result later.

liuqianhn commented 4 years ago

Closed due to no recent activities.