Open MeiShu00 opened 3 years ago
Hi @MeiShu00 Could you please share the two documents below:
samtools view
../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam chrX:148499606-148501692or
samtools view ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam X:148499606-148501692
, which works.Hi,
I have attached the following images for the full output for the command mentioned previously:
Thank you
I had similiar error message, I am running on hg38; and it seems the program strips off the chr prefix from the chromosome name?
@MeiShu00 sorry for the late reply.
@MeiShu00 @woodoo46 : I am able to reproduce the issue on my own data, and I fixed the issue. The error above on my data is caused by that there is no read aligned against the region of interest. I believe that your data has the similar situation. Meanwhile, it is not the issue of stripping off the chr prefix; In fact, the issue of 'main_samview' happens because RepeatHMM tries both a chromosome_id with chr and without chr so that RepeatHMM can tolerate the input of chromosome ids with and without chr prefix.
@liuqianhn
May i know how you managed to solve your issue ? Thanks !
@MeiShu00 It is easy to solve the issue, but it is time-consuming to re-produce the issue. To solve the issue, I only revised the code to let RepeatHMM return None if no reads are aligned against the region of interest.
@liuqianhn
Thank you for updating the code, i am able to run it now ! However, i ran into another error where i am trying to find dinucleotide repeat by using the following command:
python repeatHMM.py BAMinput --Onebamfile ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName D5S1417 --UserDefinedRepeat chr5/69011911/69011945/TG/+/ --SeqTech Nanopore --hgfile ../../GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
The error output was: [E::bwa_idx_load_from_disk] fail to locate the index files [main_samview] region "chr5:69010911-69012945" specifies an unknown reference name. Continue anyway. [main_samview] region "5:69010911-69012945" specifies an unknown reference name. Continue anyway. p2sp ['d5s1417', 17.5, [0, 0], 'allocr:17:2, 19:2, 20:1, 21:4, 22:4, 23:1, 25:1,', 15, 862] p2sp end---running time3 mem76
p2sp ['d5s1417', 17.5, [0, 0], 'allocr:17:2, 19:2, 20:1, 21:4, 22:4, 23:1, 25:1,', 15, 862]
p2sp
D5S1417 0 0;
I have tried changing the strand direction but still get the same error.
I have also tried: samtools view gu_ccpp_mm2.sorted.bam chr5:69011911-69011945 The output indicates that there are reads aligned to that region.
@MeiShu00 Thank you for your update. The error messages of bwa and samview do not suggest that your running failed. In fact, RepeatHMM generates the estimation of repeat counts for D5S1417.
However, since the maximum supporting after allocr
is 4 for repeat counts 21 and 22, the Gaussian mixture distribution cannot output meaningful estimation and thus only 0 0
is output. With manual checking, you can find that the peaks are at 21 and 22, and thus, the estimation of repeat counts for D5S1417 is 21 and 22. I will check why there is error messages.
@liuqianhn Thank you for your prompt response! Could i check what the values ,15,862] represent in p2sp?
@MeiShu00 In your running, 15
is the number of long reads which are used to estimate repeat counts, while 862
is the number of long reads which RepeatHMM failed to use.
@liuqianhn
I recently ran into another error when i try to use the --Patternfile option:
This is my command line : python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.predefined.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
The terminal output throws an error of 'None gene/repeat information is given.'. I have not edited the hg38.predefined.pa.
I have also tried wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.trf.bed.gz
and used the unzipped hg38.trf.bed as the --Patternfile but it throws the same error as well.
@MeiShu00 Could you please share the log file for me to check?
If you want to run for repeats from "--Patternfile", you might need to specify the repeat name by --repeatName
. You can use a specific name for --repeatName
(for example, 'HTT') or use --repeatName all
(available for bam infput).
@liuqianhn
I re-ran the command: python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.predefined.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --repeatName all
and it worked.
I have another 2 more questions:
1 . Regarding the usage of hg38.trf.bed file, when i tried to run python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38.trf.bed --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --repeatName all
the outputs were for the STR in the hg38.predefined.pa.
These are a few lines from hg38.trf.bed file:
I tried adding in regions of repeats which are dinucleotide repeats into the hg38.predefined.pa and the output of python repeatHMM.py BAMinput --Onebamfile ../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --Patternfile reference_sts/hg38/hg38_hms.pa --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --outFolder hg_38_hms_pa_used_log/ --repeatName all
was:
which was different from when i ran python repeatHMM.py BAMinput --Onebamfile .../../fastq_guppy/using-trimmed-reads_used__ccpp_fastq/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName D5S1417 --UserDefinedRepeat chr5/69011911/69011945/TG/+/ --SeqTech Nanopore --hgfile ../../repeatHMM_ref_indexed/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
and received this output:
Thanks for your prompt reply!
@MeiShu00 Sorry for the late reply because of my busy schedule last week. For 1, how many lines do you have in hg38.trf.bed, and how long does the program run? If you have the whole hg38.trf.bed, it will take much longer time. For 2, do you mind sharing the revised hg38.predefined.pa after you added new repeats? --Patternfile(customized) and --UserDefinedRepeat actually use the same processing.
@liuqianhn
I have attached the hg38.trf.bed, hg38_hms.pa and logfile for when using hg38.trf.bed and hg38_hms.pa
for your reference. For 1, there are 432,597 lines and it finished within 1 minute. The output of this command only had the STR as seen in the hg.predefined.pa
For 2, After a few trials and error i realised that i can only run the script if i had a minimum of 10 str regions from the hg38.predefined.pa file.
to send.zip
Thank you for your help !
Hi, i've jsut recently installed reapeatHMM following the installation instructions on github. I am currently trying to run repeatHMM in the repeathmmenv virtual environment created during installation and within the RepeatHMM/bin/ folder.
After i entered this command in the terminal:
python repeatHMM.py BAMinput --Onebamfile ../../NP-SMP_HMS/fastq_guppy/minimap2_alignment/porechop_passed/no_polish/gu_ccpp_mm2.sorted.bam --repeatName all --SeqTech Nanopore --hg hg38 --hgfile ../../NP-SMP_HMS/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
I got the following error:
[main_samview] region "X:148499606-148501692" specifies an unknown reference name. Continue anyway. ('ERROR None detection (sp)', 'aff2', ['chrX', 148500606, 148500692, 'CCG', '+15', '6-35:200+', '', ['aff2', 'chrX', '148500606', '148500692', 'CCG', '+15', '6-35:200+', '']])
Do let me know if any other information is required, thank you !