WGLab / RepeatHMM

a hidden Markov model to infer simple repeats from genome sequences
Other
34 stars 14 forks source link

error [main_samview] #36

Open MeiShu00 opened 3 years ago

MeiShu00 commented 3 years ago

Hi, i've jsut recently installed reapeatHMM following the installation instructions on github. I am currently trying to run repeatHMM in the repeathmmenv virtual environment created during installation and within the RepeatHMM/bin/ folder.

After i entered this command in the terminal:

python repeatHMM.py BAMinput --Onebamfile ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName all --SeqTech Nanopore --hg hg38 --hgfile ../../NP-SMP_HMS/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

I got the following error:

[main_samview] region "X:148499606-148501692" specifies an unknown reference name. Continue anyway. ('ERROR None detection (sp)', 'aff2', ['chrX', 148500606, 148500692, 'CCG', '+15', '6-35:200+', '', ['aff2', 'chrX', '148500606', '148500692', 'CCG', '+15', '6-35:200+', '']])

Do let me know if any other information is required, thank you !

liuqianhn commented 3 years ago

Hi @MeiShu00 Could you please share the two documents below:

  1. The whole log file.
  2. the outpuf from samtools view../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam chrX:148499606-148501692orsamtools view ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam X:148499606-148501692, which works.
MeiShu00 commented 3 years ago

Hi,

  1. Regarding the log file, where might i find it ? thanks
  2. "[main_samview] region "X:148499606-148501692" specifies an unknown reference name. Continue anyway." was the output after using samtools view

I have attached the following images for the full output for the command mentioned previously: image image

Thank you

woodoo46 commented 3 years ago

I had similiar error message, I am running on hg38; and it seems the program strips off the chr prefix from the chromosome name?

liuqianhn commented 3 years ago

@MeiShu00 sorry for the late reply.

@MeiShu00 @woodoo46 : I am able to reproduce the issue on my own data, and I fixed the issue. The error above on my data is caused by that there is no read aligned against the region of interest. I believe that your data has the similar situation. Meanwhile, it is not the issue of stripping off the chr prefix; In fact, the issue of 'main_samview' happens because RepeatHMM tries both a chromosome_id with chr and without chr so that RepeatHMM can tolerate the input of chromosome ids with and without chr prefix.

MeiShu00 commented 3 years ago

@liuqianhn

May i know how you managed to solve your issue ? Thanks !

liuqianhn commented 3 years ago

@MeiShu00 It is easy to solve the issue, but it is time-consuming to re-produce the issue. To solve the issue, I only revised the code to let RepeatHMM return None if no reads are aligned against the region of interest.

MeiShu00 commented 3 years ago

@liuqianhn

Thank you for updating the code, i am able to run it now ! However, i ran into another error where i am trying to find dinucleotide repeat by using the following command:

python repeatHMM.py BAMinput --Onebamfile ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName D5S1417 --UserDefinedRepeat chr5/69011911/69011945/TG/+/ --SeqTech Nanopore --hgfile ../../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

The error output was: [E::bwa_idx_load_from_disk] fail to locate the index files [main_samview] region "chr5:69010911-69012945" specifies an unknown reference name. Continue anyway. [main_samview] region "5:69010911-69012945" specifies an unknown reference name. Continue anyway. p2sp ['d5s1417', 17.5, [0, 0], 'allocr:17:2, 19:2, 20:1, 21:4, 22:4, 23:1, 25:1,', 15, 862] p2sp end---running time3 mem76

  p2sp ['d5s1417', 17.5, [0, 0], 'allocr:17:2, 19:2, 20:1, 21:4, 22:4, 23:1, 25:1,', 15, 862]

p2sp

D5S1417 0 0;

I have tried changing the strand direction but still get the same error.

I have also tried: samtools view gu_ccpp_mm2.sorted.bam chr5:69011911-69011945 The output indicates that there are reads aligned to that region.

liuqianhn commented 3 years ago

@MeiShu00 Thank you for your update. The error messages of bwa and samview do not suggest that your running failed. In fact, RepeatHMM generates the estimation of repeat counts for D5S1417.

However, since the maximum supporting after allocr is 4 for repeat counts 21 and 22, the Gaussian mixture distribution cannot output meaningful estimation and thus only 0 0 is output. With manual checking, you can find that the peaks are at 21 and 22, and thus, the estimation of repeat counts for D5S1417 is 21 and 22. I will check why there is error messages.

MeiShu00 commented 3 years ago

@liuqianhn Thank you for your prompt response! Could i check what the values ,15,862] represent in p2sp?

liuqianhn commented 3 years ago

@MeiShu00 In your running, 15 is the number of long reads which are used to estimate repeat counts, while 862 is the number of long reads which RepeatHMM failed to use.

MeiShu00 commented 3 years ago

@liuqianhn

I recently ran into another error when i try to use the --Patternfile option:

This is my command line : python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.predefined.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

The terminal output throws an error of 'None gene/repeat information is given.'. I have not edited the hg38.predefined.pa.

I have also tried wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.trf.bed.gz and used the unzipped hg38.trf.bed as the --Patternfile but it throws the same error as well.

liuqianhn commented 3 years ago

@MeiShu00 Could you please share the log file for me to check? If you want to run for repeats from "--Patternfile", you might need to specify the repeat name by --repeatName. You can use a specific name for --repeatName (for example, 'HTT') or use --repeatName all (available for bam infput).

MeiShu00 commented 3 years ago

@liuqianhn

I re-ran the command: python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.predefined.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --repeatName all and it worked.

I have another 2 more questions:

1 . Regarding the usage of hg38.trf.bed file, when i tried to run python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.trf.bed --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --repeatName all the outputs were for the STR in the hg38.predefined.pa.

These are a few lines from hg38.trf.bed file: image

  1. i had different outputs when using --Patternfile(customized) and --UserDefinedRepeat :

I tried adding in regions of repeats which are dinucleotide repeats into the hg38.predefined.pa and the output of python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38_hms.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --outFolder hg_38_hms_pa_used_log/ --repeatName all was: image

which was different from when i ran python repeatHMM.py BAMinput --Onebamfile .../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName D5S1417 --UserDefinedRepeat chr5/69011911/69011945/TG/+/ --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna and received this output: image

Thanks for your prompt reply!

liuqianhn commented 3 years ago

@MeiShu00 Sorry for the late reply because of my busy schedule last week. For 1, how many lines do you have in hg38.trf.bed, and how long does the program run? If you have the whole hg38.trf.bed, it will take much longer time. For 2, do you mind sharing the revised hg38.predefined.pa after you added new repeats? --Patternfile(customized) and --UserDefinedRepeat actually use the same processing.

MeiShu00 commented 3 years ago

@liuqianhn

I have attached the hg38.trf.bed, hg38_hms.pa and logfile for when using hg38.trf.bed and hg38_hms.pa

for your reference. For 1, there are 432,597 lines and it finished within 1 minute. The output of this command only had the STR as seen in the hg.predefined.pa

For 2, After a few trials and error i realised that i can only run the script if i had a minimum of 10 str regions from the hg38.predefined.pa file.
to send.zip

Thank you for your help !