Open Tine-dk opened 4 years ago
@Tine-dk
testall
error: my fault for last update. I have corrected it. Thanks for pointing it out.--hgfile RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa
is not correct. --hgfile
needs your reference genome. pa
file can be specified by --Patternfile
if needed.Thank you so much for your help. It is very appreciated.
Is this understood correctly:
If I want to use the predefined HTT in the .pa file, I need to set it as --Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa
?
I have now done the above and given my own reference genome. It solves the initial error, but now I get an error saying that the region specifies an unknown reference name. I can't seem to figure out why this happens. Looking away from this error, can the result be used (given below)?
If I interpret the results correctly it says that the allele distribution of the HTT repeat expansion is 23/23 and then there is one read with 14 repeats, [..], seven reads with 22 repeats etc. But what does the following mean: [,', 32, 8]?
How is the estimated allele size calculated? I would expect the allele distribution to be around 22/101 for this sample, so it would make sense if one of the alleles was given 23 as repeat expansion count but not both.
The input:
python repeatHMM.py BAMinput --Onebamfile G87-GM05538.bam --hg hg19 \
--hgfile /resources/b37/human_g1k_v37_decoy.fasta \
--Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa --repeatName HTT;
The error message:
[main_samview] region "chr4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[bwt_restore_sa] SA-BWT inconsistency: primary is not the same. Abort!
[main_samview] region "chr4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[main_samview] region "4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[W::fai_fetch] Reference chr4:3076574-3076603 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076574-3076603
[W::fai_fetch] Reference chr4:3076604-3076666 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076604-3076666
[W::fai_fetch] Reference chr4:3076667-3076696 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076667-3076696
p2sp end---running time7 mem58
p2sp ['htt', 21.0, [23, 23], 'allocr:14:1, 15:1, 17:1, 19:1, 20:1, 21:1, 22:7, 23:6, 24:6, 25:1, 99:1, 168:1, 170:1, 174:1, 175:1, 183:1,', 32, 8]
p2sp
HTT 23 23;
('The result is in', 'logbam/RepBAM_HTT.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_I0.120_D0.020_S0.020.log')
Cheers :)
Hi @Tine-dk
--Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa
in your command line if you use the default. --Patternfile
can be used to specify your own defined patterns in pa format or in bed format.chr1
, chr2
... or 1
, 2
for your reference genome. RepeatHMM thus will try one possibility and then try the other possibility if nothing is obtained in the first possibility. You can check your bam or reference to double confirm this.According to your output, there are more reads to support that there is a repeat region with 23 CAG repeats. The second large repeat counts in your data are not well supported, since there is 1 read to support several repeat regions with 100+ CAG repeats. It is hard to give the conclusion of large repeat counts with less supporting reads, and thus, you might need to check the 6 reads with >100+ repeat counts to see whether the allele with large repeats has enough sequenced reads. Please let me know if anything is not clear.
Hi
Thank you so much for making things clear.
Just one more thing. The last number in the [,',38,**8**]
. What does that cover? Reads where the count estimation hasn't been possible?
@Tine-dk Sorry to miss your last comments. The last two numbers are the total reads aligned against the region of interest and reads which failed to detect any repeats.
Hi :)
There seems to be a problem with the myBAMhandler.py, as I get the following error:
If I go with a FastQ instead of BAM I get the following error:
Is there something I'm doing wrong? Sorry if it is a trivial problem, I am yet not too experienced with programming. ............................................................................................................................................................
This is what I use as options and I run this from the bin:
python repeatHMM.py BAMinput --Onebamfile G87-GM05538.bam --hg hg19 --hgfile RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa --repeatName HTT;
And the following options have been accepted by the program: