Closed GraceGrace1007 closed 4 years ago
Hi @GraceGrace1007 , sorry for late reply.
Hi liuqianhn,
Thanks for your reply! Now I am trying to interpret all the detailed information in the result line.And I find some questions showed below:
1) as you said the number" 10.666666666666666" shows the repeat counts in the reference genome. There are two lines in my result: in the region X [66765160, 66765228], the result is "p2sp ['ar', 23.0, [0, 0], 'allocr:0:12, 18:2, 21:2, 25:1,', 17, 1]", in order to check the the repeat counts in the reference genome, I used samtools to check the motif in this region on the reference genome like the following:
X:66765160-66765228 CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAG CAA the last one is CAA not CAG, but the counts is 23.0 not 22.5, so how does the RepeatHMM calculates the repeat counts in the reference genome?
2) you showed that [XX, YY] is the estimated allele size, followed by a distribution of all estimated repeat sizes. eg: p2sp ['csf1po', 19.25, [16, 17], 'allocr:0:1, 15:3, 16:5, 17:4, 18:3,', 16, 2] So in the above line, I am still confused about the meaning of [16, 17]? And What is the difference between [0,0] and [16, 17]?
3) in the result line, eg: 'p2sp ['csf1po', 19.25, [16, 17], 'allocr:0:1, 15:3, 16:5, 17:4, 18:3,', 16, 2]' what does the meaning of the last two numbers "16,2" ?
4) for the final result, do I need to do some filtration to remove the results with lower accuracy? Do you have some suggestions?
Hi @GraceGrace1007 ,
Hi liuqianhn,
Sorry about my tiresome questions! I think I got what you said but I really want to double check for that.
I want to use two result line as examples: 1) p2sp ['d10s1248', 13.25, [0, 0], 'allocr:11:2, 12:2, 13:3, 14:2, 15:2, 16:1,', 12, 1] 2) p2sp ['d17s1301', 12.25, [11, 12], 'allocr:0:1, 11:4, 12:10, 13:3, 14:2, 15:1,', 21, 0]
So from your reply, compared with the two lines, if I wan to do filtering, it would be better to keep the lines which have support reads such as the 2nd line with [11, 12] reads on two alleles? I am still confused about the meaning of [X,Y] and your answer about the 2nd question.
In the third question, you said that '16' is the total number of reads in the distribution, what about the following number '2'? And in the above two lines, the last number are 1 and 0, what does that mean?
@GraceGrace1007 , it is ok, and feel free to ask questions when using RepeatHMM.
Hi liuqianhn,
Thanks for your reply! I am sorry that I cannot put all my questions at once, cause I always find some new questions.
First of all, we talked about the minimum threshold to support each estimation of repeat count and there is also default value in RepeatHMM, so I guess that the parameter ''--MinSup'' with default=5 controls the minimum threshold, Am I right? If it is true, that means 'no estimated repeat count have more than 5 supporting reads, [0,0] should be output', but in my result, I got lines like the following, which with different reads in the distributions, the first one is '3' which is less than 5, another one is '6' which is more than 5 reads. p2sp ['ar', 23.0, [0, 0], 'allocr:0:2, 15:1,', 3, 0] p2sp ['aff2', 29.0, [0, 0], 'allocr:0:5, 23:1,', 6, 0] But both of them have the output of [0,0], so why?
And also, I found that in the final result, some lines started with gene name, like p2sp ['aff2', 29.0, [0, 0], 'allocr:0:5, 23:1,', 6, 0], AFF2 is a gene, but some lines are started with gene locus, like p2sp ['d10s1435', 15.5, [14, 15], 'allocr:10:1, 13:1, 14:2, 15:8, 16:1,', 13, 2], d10s1435 is a gene locus, so why the output has the different resolutions? And I want to transfer the gene locus to genes, Do you have some suggestions to do this?
All the best
@GraceGrace1007 , feel free to post your concerns.
--MinSup
is used to control whether enough evidence is enough to support a repeat count estimation, but this threshold is used for each repeat count estimation not for all reads detected. In your example, 6 reads are found for aff2
, but 5 support 0
repeat counts which is noisy, and only 1 read supports 23
estimation, and 1 read is not enough evidence. ar
has the similar situation as aff2
gene name
and gene locus
. Do not worry about this difference. This is just a name for users' understanding, and the detail of repeat is defined by --UserDefinedRepeat
. That is, you can given any name for the different output (gene name
or gene locus
or any name you want or even aaaaa
/bbbb
), which does not affect the estimation, since they are something users can understand only. If you want that output similarly always, you can specify --repeatName
for a specific locus, or revise the names in RepeatHMM/bin/reference_sts/hg38/hg38.predefined.pa
. Hope it answer your question.
Feel free to ask questions. All the best
Hi liuqianhn,
I want to add some regions which I focused on in the config file RepeatHMM/bin/reference_sts/hg38/hg38.predefined.pa, but in the *.pa file, there are 7 columns, I am not sure the meaning of the last three columns (5.strand, 6.range, 7.others), so could you please tell me how can I prepare for that ?
All the best!
Hi @GraceGrace1007 ,
+
means the repeat motif matches the reference sequence, and -
mean it matches the complementary sequence of reference. The number following +/-
is not used but for users' understanding.Hi liuqianhn,
Thanks for your reply!
All the best!
Hi liuqianh,
I tried to add some regions which I focused on. The following lines show the regions I added in the *.pa file,
AADACL2-AS1,3,151548071,151548094,ACAA,6,, AADACL2-AS1,3,151549641,151549655,AAAGA,3,, AADACL2-AS1,3,151552946,151552981,AGAT,9,, AADACL2-AS1,3,151554382,151554396,AAAAT,3,, AADACL2-AS1,3,151556631,151556650,AAAC,5,, AADACL2-AS1,3,151567876,151567890,TTG,5,, AADACL2-AS1,3,151580443,151580460,ACA,6,, AADACL2-AS1,3,151606794,151606813,TTTA,5,, AADACL2-AS1,3,151615223,151615238,AAAT,4,, AADACL2-AS1,3,151627851,151627865,ATTTT,3,,
But in my result, I only got one result on AADACL2-AS1, like the following so why? p2sp ['aadacl2-as1', 3.0, [0, 0], 'allocr:3:9, 9:2,', 11, 0] Also I found something wired that based on our above discussion, there are 9 reads support 3, and in the reference genome, the repeat counts are "3", so [3,3] should be the output instead of [0,0] am I right? I got many lines like this, which [0,0] is the output : eg. p2sp ['abcf1', 4.0, [0, 0], 'allocr:3:3, 4:9, ', 12, 0] p2sp ['abhd12b', 4.0, [0, 0], 'allocr:4:7, 5:1,', 8, 1]
And I really want to know the differences between the reference and my samples, so for me, I think it would be better to focus on the output with [0,0], which means that there are more reads support no estimated repeat counts which are different with the reference genome. But in my result, there are a lot of [0,0] other than [X,X].
All the best!
Hi @GraceGrace1007 , the output is expected, since I set a threshold of minimum repeat count is 5. Any repeat count prediction <5 will be ignored, since they are too small. This setting is at line "176" with "minrepcount = 5;" in "myGaussianMixtureModel.py". You might set minrepcount = 2;
in your prediction if you are interested in those very small repeat count.
[0,0]
does not mean the existing allele are different from the reference, but means that there is no strong evidence for repeat count inference. If you are interested in the difference of repeat counts in tested samples and the reference, you might need to investigate the predicted repeat counts with the reference.
Hi there,
I run the RepeatHMM and got the result like the following:
p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr:2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1]
I really confused about the number" 10.666666666666666", so what does it mean? And how can I know the detailed repeat sequence such like 'AAAAATTTC' . And another question is if RepeatHMM can detect the STR on genes which are not included in the gene list? Cause I tried some genes which are not in the list but I failed.
Hi @GraceGrace1007 , sorry for late reply.
- The number is calculated according to the input (repeat counts in the reference genome). I will format the output of this number.
- RepeatHMM now gives the estimation of repeat counts for the input region of interest with repeat motifs. RepeatHMM does not output repeat sequences here.
- RepeatHMM needs the regions of interest and a repeat motif as input, no mater where is the region and what is the repeat motif. You can give any repeat in a region as you are interested in. I do not know your input parameters, and thus have no idea about what you mean 'failed'. It would be great if you can give some detail. Thank you.
hi there,
I run the RepeatHMM and got the result like the following:
p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr: 2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1];
with so many info as 2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1; how to decide the correct repeat units? I mean whether it is true that the more support reads the more trustworthy of the repeat units?
Thanks for your reply!waiting for your comments~
Have a good day
Hi @JinyanFeng0217 , more supporting reads usually suggest a reliable estimation. According to your output, it seems that the coverage is low, and the number of supporting reads is lower than a threshold for all estimated repeat counts.
If you still want to infer estimated repeat counts from lower coverage data, the repeat estimation from your ouput might be 11 and 15 since the two estimated repeat counts have 3 supporting reads. However, lower coverage might indicate less reliable estimation in your output.
Hi @JinyanFeng0217 , more supporting reads usually suggest a reliable estimation. According to your output, it seems that the coverage is low, and the number of supporting reads is lower than a threshold for all estimated repeat counts.
If you still want to infer estimated repeat counts from lower coverage data, the repeat estimation from your ouput might be 11 and 15 since the two estimated repeat counts have 3 supporting reads. However, lower coverage might indicate less reliable estimation in your output.
thanks so much. I will check the result later.
Closed due to no recent activities.
Hi there,
I run the RepeatHMM and got the result like the following:
p2sp ['d9s2157', 10.666666666666666, [0, 0], 'allocr:2:1, 4:1, 7:1, 8:1, 9:1, 11:3, 12:1, 13:1, 14:1, 15:3, 16:2, 17:1,', 17, 1]
I really confused about the number" 10.666666666666666", so what does it mean? And how can I know the detailed repeat sequence such like 'AAAAATTTC' . And another question is if RepeatHMM can detect the STR on genes which are not included in the gene list? Cause I tried some genes which are not in the list but I failed.