Closed HLHsieh closed 1 year ago
Hello,
welcome to use NanoRepeat.
I can see that samtools reported this error: region "9:9990-10482" specifies an invalid region or unknown reference.
It might be that the chromosome names in your genome.fa
is in this style: chr9
instead of 9
.
If so, please add the chr
prefix in the chr9_hg38_NanoRepeat.bed
file.
Best, Li
Hello, Is this issue solved?
Thanks, Li
Hi Li,
Thanks for your suggestion. I added the chr
prefix in my chr9_hg38_NanoRepeat.bed
file, got the following message (I excerpted some of them here):
[03/21/2023 16:27:48] NOTICE: Input file is: ~/stimulated_test/C9ORF72_c1_simulated_reads.fasta
[03/21/2023 16:27:48] NOTICE: Input type is: fasta
[03/21/2023 16:27:48] NOTICE: Referece fasta file is: ~/Reference/Human/Genome/hg38/genome.fa
[03/21/2023 16:27:48] NOTICE: Output prefix is: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads
[03/21/2023 16:27:48] NOTICE: Repeat region bed file is: ~/genome_TRF/chr9_hg38_NanoRepeat_chr.bed
[03/21/2023 16:59:57] NOTICE: Reading repeat region file: ~/genome_TRF/chr9_hg38_NanoRepeat_chr.bed
[03/21/2023 16:59:57] NOTICE: Reading reference fasta file: ~/Reference/Human/Genome/hg38/genome.fa
[03/21/2023 17:01:00] NOTICE: Quantifying repeat: chr9-10000-10472-TAACCC
[03/21/2023 17:01:01] NOTICE: Step 1: finding anchor location in reads
[03/21/2023 17:01:01] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8 -x map-ont ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/anchors.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/chr9-10000-10472-TAACCC.fastq > ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/anchor_locations.paf 2> /dev/null
[03/21/2023 17:01:01] NOTICE: Step 1 finished
[03/21/2023 17:01:01] NOTICE: Step 2: round 1 and round 2 estimation
[03/21/2023 17:01:01] NOTICE: Step 2 finished
[03/21/2023 17:01:01] NOTICE: Step 3: round 3 estimation
[03/21/2023 17:01:01] NOTICE: Step 3 finished
[03/21/2023 17:01:01] NOTICE: Writing to repeat size file...
[03/21/2023 17:01:01] NOTICE: Step 4: phasing reads using GMM
[03/21/2023 17:01:01] ERROR! No reads were found for repeat region: chr9-10000-10472-TAACCC
[03/21/2023 17:01:01] NOTICE: Quantifying repeat: chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[03/21/2023 17:01:01] NOTICE: Step 1: finding anchor location in reads
[03/21/2023 17:01:01] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8 -x map-ont ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/anchors.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG.fastq > ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/anchor_locations.paf 2> /dev/null
[03/21/2023 17:01:01] NOTICE: Step 1 finished
[03/21/2023 17:01:02] NOTICE: Step 2: round 1 and round 2 estimation
[03/21/2023 17:01:02] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8 -x map-ont ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/round1_ref.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/core_sequences.fastq > ~/hsinlun/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/round1.paf 2> /dev/null
[03/21/2023 17:01:02] NOTICE: Step 2 finished
[03/21/2023 17:01:02] NOTICE: Step 3: round 3 estimation
[03/21/2023 17:01:02] NOTICE: Step 3 finished
[03/21/2023 17:01:02] NOTICE: Writing to repeat size file...
[03/21/2023 17:01:02] NOTICE: Step 4: phasing reads using GMM
[03/21/2023 17:01:05] NOTCIE: Number of alleles=1
[03/21/2023 17:01:06] NOTICE: Writing phasing results...
[03/21/2023 17:01:06] NOTICE: Writing to output fastq files...
[03/21/2023 17:01:06] NOTICE: Writing summary file...
[03/21/2023 17:01:06] NOTICE: Plotting figures...
[03/21/2023 17:01:06] NOTICE: Quantifying repeat: chr9-11165-11194-G
I guess we fixed one of my issues since I got the following output files (excerpted some of them):
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.hist.png
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.phased_reads.txt
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.repeat_size.txt
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.summary.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.allele1.fastq
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.hist.png
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.phased_reads.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.repeat_size.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.summary.txt
However, I still got the same error message at this run as I mentioned in the previous post:
[03/21/2023 17:04:49] NOTICE: Quantifying repeat: chr9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA
Traceback (most recent call last):
File "~/bin/NanoRepeat/nanoRepeat.py", line 185, in <module>
main()
File "~/bin/NanoRepeat/nanoRepeat.py", line 174, in main
preprocess_fastq(input_args)
File "~/bin/NanoRepeat/nanoRepeat.py", line 91, in preprocess_fastq
nanoRepeat_bam.nanoRepeat_bam(input_args, in_bam_file)
File "~/bin/NanoRepeat/nanoRepeat_bam.py", line 578, in nanoRepeat_bam
quantify1repeat_from_bam(input_args, in_bam_file, ref_fasta_dict, repeat_region)
File "~/bin/NanoRepeat/nanoRepeat_bam.py", line 523, in quantify1repeat_from_bam
os.makedirs(temp_out_dir, exist_ok=True)
File "/sw/spack/bio/pkgs/gcc-10.3.0/python/3.9.7-rv5ybzg3/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
OSError: [Errno 36] File name too long: '~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-106460-108380CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA'
Whether is it caused by the length of repeat?
I would appreciate it if you could advice.
Sincerely, Hsin
Hi,
Here is the updates.
The tool worked successfully after removing "chr9-106460-108380-CCCTGTGATATTGT..." from my chr9_hg38_NanoRepeat.bed
file.
Do you have any comments on it?
Best, Hsin
Hello Hisin
I can see this error:
OSError: [Errno 36] File name too long: '~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-106460-108380CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA
NanoRepeat will insert the repeat unit into the file name. If the repeat unit is too long, the filename may exceed the max length of the operating system.
I will fix this bug later today
Hi Li,
Thanks for your prompt response. Could I consider that NanoRepeat is able to detect the repeat unit more than 10 base pair in this case?
Best, Hsin
Yes. I think it should work for repeats more than 10bp. There is no length limit in terms of the algorithm, but I have not tested on repeats that are several hundred bp.
Good to know that! Appreciated all your help.
You're welcome.
Hi @fangli80
I am just following up on whether you have fixed the filename issue. Have you considered not insert the repeat unit into the file name or specify it by users? Does it work?
Many Thanks, Hsin
Hello Hsin, Sorry for the late update. The file name issue has been fixed in the latest version. Please use https://github.com/WGLab/NanoRepeat/releases/tag/v1.3 If the repeat unit is longer than 30bp, only part of the sequence will be inserted into the file name.
Best, Li
Hello Li,
Thank you! I tried it and it went smoothly.
Thanks, Hsin
Thank you for the feedback!
Hi there,
I am very interested in this tool.
I got the REPEAT_region.bed from the database of Tandem Repeat Finder and executed the following command.
python $script -i ${prefix}.fasta -t fasta -r $genome -b $predefined -c 4 --samtools $samtools_path --minimap2 $minimap2_path -o ${output_folder}/${prefix}
And then I got the following message (an excerpt here):
Am I wrong to use the bed file from Tandem Repeat Finder?
Here is my REPEAT_region.bed:
Besides, I got the error message as follows:
I guess it is because NanoRepeat is designed for STR. Any comments on it?
Thanks, Hsin