WGLab / RepeatHMM

a hidden Markov model to infer simple repeats from genome sequences
Other
34 stars 14 forks source link

NameError: global name 'testall' is not defined #33

Open Tine-dk opened 4 years ago

Tine-dk commented 4 years ago

Hi :)

There seems to be a problem with the myBAMhandler.py, as I get the following error:

Traceback (most recent call last):
  File "repeatHMM.py", line 713, in <module>
    args.func(args)
  File "repeatHMM.py", line 515, in BAMinput
    commonOptions, specifiedOptions)
  File "/RepeatHMM/bin/RepeatHMM_scripts/myBAMhandler.py", line 627, in getRepeatForKnownGene
    if (commonOptions['SplitAndReAlign'] in [0, 2]) or testall:
NameError: global name 'testall' is not defined

If I go with a FastQ instead of BAM I get the following error:

[E::fai_build_core] Different line length in sequence '(null)'
Could not load fai index of /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa

Is there something I'm doing wrong? Sorry if it is a trivial problem, I am yet not too experienced with programming. ............................................................................................................................................................

This is what I use as options and I run this from the bin: python repeatHMM.py BAMinput --Onebamfile G87-GM05538.bam --hg hg19 --hgfile RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa --repeatName HTT;

And the following options have been accepted by the program:

The following options are used (included default):
       BWAMEMOptions    ( -k8 -W8 -r7 );
             CompRep    (0);
           MatchInfo    ([3, -2, -2, -15, -1]);
              MaxRep    (4000);
              MinSup    (5);
         Patternfile    (None);
          RepeatTime    (5);
             SeqTech    (None);
     SplitAndReAlign    (1);
          TRFOptions    (2_7_4_80_10_100);
   Tolerate_mismatch    (None);
   UserDefinedUniqID    (None);
               align    (align/);
           emissionm    (None);
                  hg    (hg19);
              hgfile    (/RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa);
        hmm_del_rate    (0.02);
     hmm_insert_rate    (0.12);
        hmm_sub_rate    (0.02);
     isGapCorrection    (1);
       minRepBWTSize    (70);
         minTailSize    (70);
              outlog    (2);
   repeatFlankLength    (30);
          repeatName    (HTT);
 specifiedRepeatInfo    (///////);
      stsBasedFolder    (reference_sts/);
         transitionm    (None);

          Onebamfile    (G87-GM05538.bam);
      SepbamfileTemp    (None);
               align    (align/);
    analysis_file_id    (_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_I0.120_D0.020_S0.020);
             bamfile    (G87-GM05538.bam);
      unique_file_id    (.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_I0.120_D0.020_S0.020);
liuqianhn commented 4 years ago

@Tine-dk

  1. testall error: my fault for last update. I have corrected it. Thanks for pointing it out.
  2. --hgfile RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa is not correct. --hgfile needs your reference genome. pa file can be specified by --Patternfile if needed.
Tine-dk commented 4 years ago

Thank you so much for your help. It is very appreciated.

Is this understood correctly: If I want to use the predefined HTT in the .pa file, I need to set it as --Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa?

I have now done the above and given my own reference genome. It solves the initial error, but now I get an error saying that the region specifies an unknown reference name. I can't seem to figure out why this happens. Looking away from this error, can the result be used (given below)?

If I interpret the results correctly it says that the allele distribution of the HTT repeat expansion is 23/23 and then there is one read with 14 repeats, [..], seven reads with 22 repeats etc. But what does the following mean: [,', 32, 8]?

How is the estimated allele size calculated? I would expect the allele distribution to be around 22/101 for this sample, so it would make sense if one of the alleles was given 23 as repeat expansion count but not both.

The input:

python repeatHMM.py BAMinput --Onebamfile G87-GM05538.bam --hg hg19 \
--hgfile  /resources/b37/human_g1k_v37_decoy.fasta \
--Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa  --repeatName HTT;

The error message:

[main_samview] region "chr4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[bwt_restore_sa] SA-BWT inconsistency: primary is not the same. Abort!
[main_samview] region "chr4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[main_samview] region "4:3075604-3077666" specifies an unknown reference name. Continue anyway.
[W::fai_fetch] Reference chr4:3076574-3076603 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076574-3076603
[W::fai_fetch] Reference chr4:3076604-3076666 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076604-3076666
[W::fai_fetch] Reference chr4:3076667-3076696 not found in FASTA file, returning empty sequence
Failed to fetch sequence in chr4:3076667-3076696
p2sp end---running time7 mem58

      p2sp ['htt', 21.0, [23, 23], 'allocr:14:1, 15:1, 17:1, 19:1, 20:1, 21:1, 22:7, 23:6, 24:6, 25:1, 99:1, 168:1, 170:1, 174:1, 175:1, 183:1,', 32, 8]

        p2sp
HTT      23 23;

('The result is in', 'logbam/RepBAM_HTT.gmm_GapCorrection1_FlankLength30_SplitAndReAlign1_2_7_4_80_10_100_hg19_comp_I0.120_D0.020_S0.020.log')

Cheers :)

liuqianhn commented 4 years ago

Hi @Tine-dk

  1. You do not have to use --Patternfile /RepeatHMM/bin/reference_sts/hg19/hg19.predefined.pa in your command line if you use the default. --Patternfile can be used to specify your own defined patterns in pa format or in bed format.
  2. The error output by samtools should be fine if you have the final results. The errors here happen is due to that: repeathmm do not know your bam or reference fa use chr1, chr2... or 1, 2 for your reference genome. RepeatHMM thus will try one possibility and then try the other possibility if nothing is obtained in the first possibility. You can check your bam or reference to double confirm this.
  3. You can interpret the final output as a dictionary with the detected repeat counts as the key and the number of supporting reads as the values(Please refer to result interpretation here). The first of the last two numbers are the total supporting reads, and they do not affect your final conclusion.

According to your output, there are more reads to support that there is a repeat region with 23 CAG repeats. The second large repeat counts in your data are not well supported, since there is 1 read to support several repeat regions with 100+ CAG repeats. It is hard to give the conclusion of large repeat counts with less supporting reads, and thus, you might need to check the 6 reads with >100+ repeat counts to see whether the allele with large repeats has enough sequenced reads. Please let me know if anything is not clear.

Tine-dk commented 4 years ago

Hi

Thank you so much for making things clear. Just one more thing. The last number in the [,',38,**8**]. What does that cover? Reads where the count estimation hasn't been possible?

liuqianhn commented 3 years ago

@Tine-dk Sorry to miss your last comments. The last two numbers are the total reads aligned against the region of interest and reads which failed to detect any repeats.