WGLab / RepeatHMM

a hidden Markov model to infer simple repeats from genome sequences
Other
34 stars 14 forks source link

not enough evidence to support the estimation #21

Closed wenluo711 closed 5 years ago

wenluo711 commented 5 years ago

Hi RepeatHMM team,

We have used targeted pacbio sequencing of ~4kb on pacbio platform for patient samples in order to detect a known tandem repeat deletion with pattern GGAA and possible mixed GAAA or AGAA or GGGA instead of GGAA in some parts. Although the coverage in the 30s in repeat region, we can clearly see the deletion in some of the reads when we view the bam file in IGV. I tried options like --MinSup 3 --CompRep G/AlG/AlG/AlA but still couldn't get any output other than [0,0].

Here is my output file:

INFO: Input BAM file is sample.0.bam INFO: INFO: splitBAM1= samtools view sample.0.bam chr6:6834479-6840608>align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam INFO: patset ['AAAA', 'AAAA', 'AAAA', 'AAAA', 'TTTT', 'TTTT', 'TTTT', 'TTTT', 'GAAA', 'AAAG', 'AAGA', 'AGAA', 'TTTC', 'CTTT', 'TCTT', 'TTCT', 'AGAA', 'GAAA', 'AAAG', 'AAGA', 'TTCT', 'TTTC', 'CTTT', 'TCTT', 'GGAA', 'GAAG', 'AAGG', 'AGGA', 'TTCC', 'CTTC', 'CCTT', 'TCCT', 'AAGA', 'AGAA', 'GAAA', 'AAAG', 'TCTT', 'TTCT', 'TTTC', 'CTTT', 'GAGA', 'AGAG', 'GAGA', 'AGAG', 'TCTC', 'CTCT', 'TCTC', 'CTCT', 'AGGA', 'GGAA', 'GAAG', 'AAGG', 'TCCT', 'TTCC', 'CTTC', 'CCTT', 'GGGA', 'GGAG', 'GAGG', 'AGGG', 'TCCC', 'CTCC', 'CCTC', 'CCCT'] INFO: trf align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa 2 7 4 80 10 100 8 -h -ngs > align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa_2_7_4_80_10_100_8.res INFO: splitfn align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8.fa INFO: spfnbam align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8_sorted.bam INFO: no_repeat_id_list 416 INFO: short_repeat_id_list 0 INFO: repeat_id_list 25 INFO: reAlign bwa mem -k8 -W8 -r7 -A1 -B1 -O1 -E1 -L0 -t 1 -v 2 hg19_canonical_correct_chr_order.fa align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8.fa | samtools view -S -b | samtools sort > align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8_sorted.bam INFO: expRegionInLongRead= 0 25 INFO: nonRepeatinLongRead= 0 268 INFO: tol_info=['AAA:GAA:0:2;AAA:GGA:0:2;AAA:GAG:0:2;AAA:GGG:0:2;AAAA:AGAA:-1:2;AAAA:GGAA:-1:2;AAAA:AGGA:-1:2;AAAA:GGGA:-1:2;AAAA:AAGA:-2:1;AAAA:GAGA:-2:1;AAAA:AGGA:-2:1;AAAA:GGGA:-2:1', 167, 12] INFO: allocr: INFO: rm split_norep_fn rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8_split_norep.fa INFO: rm spfnbam_bai rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8_sorted.bam.bai INFO: rm spfnbam rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8_sorted.bam INFO: rm splitfn rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa2_7_4_80_10_100_8.fa INFO: rm samfile rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam INFO: rm trf_resfn rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa_2_7_4_80_10_100_8.res INFO: rm fafile rm align//sample.0.bam.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.sam.fa INFO: rm spfnbam_sam_interest rm align/rn7sl554p.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_compRep_Pacbio_I0.120_D0.020_S0.020.alignment.sam INFO: INFO: p2sp ['rn7sl554p', 1032.5, [0, 0], 'allocr:', 0, 293] INFO: INFO: p2sp INFO: RN7SL554P 0 0; INFO:

Is there any customization that would help in my case?

Thanks, Wen

liuqianhn commented 5 years ago

Hi @wenluo711 , I have simply checked "chr6:6,835,479-6,839,608" which is the input repeat region in the reference genome. It is ~4kb. But I did not see the repeat from UCSC genome browser (hg38). May I know where is the repeat region in reference genome?

wenluo711 commented 5 years ago

Hi @liuqianhn , Thanks for notifying me that. The repeat I'm trying to find is https://genome.ucsc.edu/cgi-bin/hgc?hgsid=727407937_NiP6MIBBAlic6yDeCOakP6PMy6AK&c=chr6&l=6836788&r=6837688&o=6837193&t=6837283&g=simpleRepeat&i=trf

After adjust the region to the UCSC repeat region, I'm able to get a non [0,0] result.

Here is my result p2sp ['rn7sl554p', 22.5, [18, 36], 'allocr:3:1, 5:1, 6:4, 9:2, 11:3, 12:7, 13:12, 14:8, 15:3, 16:20, 17:23, 18:28, 19:28, 20:9, 21:10, 22:2, 23:2, 25:1, 27:1, 28:1, 29:2, 30:1, 33:3, 34:1, 35:4, 36:13, 37:6, 38:5, 39:3, 40:2, 41:1, 43:1, 44:1, 46:1, 52:1,', 211, 43]

I get it from the help that [XX, YY] is the estimated allele size. And allocr: is the dictionary of the estimation repeat count. In 'X:Y', 'X' is the estimation repeat count, and 'Y' is the number of supporting reads.

So in my result, does that mean one allele is 18 copies of repeat and the other is 36 copies? What does the last two digit, 211, 43 mean?

I got the repeat list as following, can I concludes one haplotype depth is 202 and the other is 57? INFO: no_repeat_id_list 202 INFO: repeat_id_list 57

Also how can I determine if the repeat is on + strand or - strand? I set in my predifined pattern "+", but I'm not sure what strand is the the repeat actually on. From IGV view of the bam, we can see some deletion in some of the reads.

Thanks, Wen

wenluo711 commented 5 years ago

Hi @liuqianhn , I sort of figured out my last question, the last 2 number in the p2sp seems to be the sum of all the allocr bins. Now there is another things that puzzled us. When we used the chr6:6837194-6837283 region to call the repeat number, some samples will be able to call properly, but one which we can see a clear deletion of ~100bp in some of the reads in IGV. Then we extended out repeat region to chr6:6837194-6837385 where the clear deletion ends. we will be able to get the following result that perfectly catches what we see: p2sp ['rn7sl554p', 48.0, [13, 49], 'allocr:6:2, 7:1, 9:1, 12:2, 13:4, 15:2, 16:1, 17:2, 29:1, 44:1, 47:3, 48:1, 49:3, 50:2, 51:3, 53:1, 56:1, 61:1, 63:1,', 33, 6]

When we extended even more to the next visible deletion chr6:6837194-6837406, we got same repeat copy on the 2 alleles. p2sp ['rn7sl554p', 53.25, [53, 53], 'allocr:19:1, 21:3, 22:2, 23:1, 33:1, 38:1, 50:1, 51:3, 52:2, 53:6, 54:2, 55:2, 56:1, 57:2, 59:1, 60:1, 65:1, 66:1, 73:1,', 33, 6]

It there any suggestions in terms of the pattern region in our case? or should we just stick to the UCSC defined region even though indels out range it?

Thanks, Wen

liuqianhn commented 5 years ago

Hi @wenluo711 , I would recommend using repeat regions in well-defined format. In your case, I will suggest to use the region in UCSC defined region with smaller adjustment (smaller adjustment means the difference between an adjusted position and UCSC defined position is less than 1.5 times the length of repeat unit (in your case, repeat unit is 4).

In particular, in your case, there are many similar repeats at the downstream of the region of interest, which might affect the alignment and then affect the RepeatHMM results. Thus, you might need to pay more attention to the aligned reads with this regions since inserted repeats might not be well defined in the alignment.

Sorry to miss your previous questions.

So in my result, does that mean one allele is 18 copies of repeat and the other is 36 copies? What does the last two digit, 211, 43 mean?

Yes, the two alleles in this example is 18 copies and 36 copies. The last two digits do not help the interpretation of the alleles here.

I got the repeat list as following, can I concludes one haplotype depth is 202 and the other is 57?
INFO: no_repeat_id_list 202
INFO: repeat_id_list 57

These two numbers are for different purposes, and not related the haplotype depth.

Also how can I determine if the repeat is on + strand or - strand? I set in my predifined pattern "+", but I'm not sure what strand is the the repeat actually on. From IGV view of the bam, we can see some deletion in some of the reads.

In predefined pattern file, + strand means the pattern provided is in the template reference genome, while - strand means the pattern provided is in the reverse strand of the reference genome. For example, if a CAG repeat can be seen in the reference genome, + strand is given in the predefined pattern file. Alternatively, you can also say this repeat is in - strand, but the repeat pattern is CTG rather than CAG.

hygine commented 5 years ago

@liuqianhn what does the 22.5 mean (in the num 2 digist)?

liuqianhn commented 5 years ago

Hi @hygine , this kind of numbers usually just follows the repeat names, which is repeat information in the reference genome. For example, 22.5 means that in the reference genome, there are 22.5 repeats, which is not estimation of repeat count for the query sample.