WGLab / NanoRepeat

NanoRepeat: fast and accurate analysis of Short Tandem Repeats (STRs) from Oxford Nanopore sequencing data
MIT License
17 stars 1 forks source link

Question about repeat location and issue about OSError #4

Closed HLHsieh closed 1 year ago

HLHsieh commented 1 year ago

Hi there,

I am very interested in this tool.

I got the REPEAT_region.bed from the database of Tandem Repeat Finder and executed the following command. python $script -i ${prefix}.fasta -t fasta -r $genome -b $predefined -c 4 --samtools $samtools_path --minimap2 $minimap2_path -o ${output_folder}/${prefix}

And then I got the following message (an excerpt here):

[03/17/2023 21:43:26] NOTICE: Input file is: ~/stimulated_test/C9ORF72_c1_simulated_reads.fasta
[03/17/2023 21:43:26] NOTICE: Input type is: fasta
[03/17/2023 21:43:26] NOTICE: Referece fasta file is: ~/Reference/Human/Genome/hg38/genome.fa
[03/17/2023 21:43:26] NOTICE: Output prefix is: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads
[03/17/2023 21:43:26] NOTICE: Repeat region bed file is: ~/genome_TRF/chr9_hg38_NanoRepeat.bed
[03/17/2023 22:15:40] NOTICE: Reading repeat region file: ~/genome_TRF/chr9_hg38_NanoRepeat.bed
[03/17/2023 22:15:40] NOTICE: Reading reference fasta file: ~/Reference/Human/Genome/hg38/genome.fa
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-10000-10472-TAACCC
[main_samview] region "9:9990-10482" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-10000-10472-TAACCC
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[main_samview] region "9:10966-11062" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-11165-11194-G
[main_samview] region "9:11155-11204" specifies an invalid region or unknown reference. Continue anyway.
[03/17/2023 22:16:43] WARNING! No reads were found in repeat region: 9-11165-11194-G
[03/17/2023 22:16:43] NOTICE: Quantifying repeat: 9-11338-11561-CGCCCCCTGCTGGCAGCTAGGGACACTGCAGGGCCCTCTTGCTCAAGGTATAGTGGTGGCA
[main_samview] region "9:11328-11571" specifies an invalid region or unknown reference. Continue anyway.

Am I wrong to use the bed file from Tandem Repeat Finder?

Here is my REPEAT_region.bed:

image

Besides, I got the error message as follows:

[03/17/2023 22:16:48] WARNING! No reads were found in repeat region: 9-106323-106481-TGTTCTTAATATTCAGAGAGGGAGACAATGATAATAATGTCAATATAACAGGGGACACACCAGCCCCGTGACGT
[03/17/2023 22:16:48] NOTICE: Quantifying repeat: 9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA
Traceback (most recent call last):
  File ~/bin/NanoRepeat/nanoRepeat.py, line 185, in <module>
    main()
  File ~/bin/NanoRepeat/nanoRepeat.py, line 174, in main
    preprocess_fastq(input_args)
  File ~/bin/NanoRepeat/nanoRepeat.py, line 91, in preprocess_fastq
    nanoRepeat_bam.nanoRepeat_bam(input_args, in_bam_file)
  File ~/bin/NanoRepeat/nanoRepeat_bam.py, line 578, in nanoRepeat_bam
    quantify1repeat_from_bam(input_args, in_bam_file, ref_fasta_dict, repeat_region)
  File ~/bin/NanoRepeat/nanoRepeat_bam.py, line 523, in quantify1repeat_from_bam
    os.makedirs(temp_out_dir, exist_ok=True)
  File /sw/spack/bio/pkgs/gcc-10.3.0/python/3.9.7-rv5ybzg3/lib/python3.9/os.py, line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 36] File name too long: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA

I guess it is because NanoRepeat is designed for STR. Any comments on it?

Thanks, Hsin

fangli80 commented 1 year ago

Hello, welcome to use NanoRepeat. I can see that samtools reported this error: region "9:9990-10482" specifies an invalid region or unknown reference.

It might be that the chromosome names in your genome.fa is in this style: chr9 instead of 9.

If so, please add the chr prefix in the chr9_hg38_NanoRepeat.bed file.

Best, Li

fangli80 commented 1 year ago

Hello, Is this issue solved?

Thanks, Li

HLHsieh commented 1 year ago

Hi Li,

Thanks for your suggestion. I added the chr prefix in my chr9_hg38_NanoRepeat.bed file, got the following message (I excerpted some of them here):

[03/21/2023 16:27:48] NOTICE: Input file is: ~/stimulated_test/C9ORF72_c1_simulated_reads.fasta
[03/21/2023 16:27:48] NOTICE: Input type is: fasta
[03/21/2023 16:27:48] NOTICE: Referece fasta file is: ~/Reference/Human/Genome/hg38/genome.fa
[03/21/2023 16:27:48] NOTICE: Output prefix is: ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads
[03/21/2023 16:27:48] NOTICE: Repeat region bed file is: ~/genome_TRF/chr9_hg38_NanoRepeat_chr.bed
[03/21/2023 16:59:57] NOTICE: Reading repeat region file: ~/genome_TRF/chr9_hg38_NanoRepeat_chr.bed
[03/21/2023 16:59:57] NOTICE: Reading reference fasta file: ~/Reference/Human/Genome/hg38/genome.fa
[03/21/2023 17:01:00] NOTICE: Quantifying repeat: chr9-10000-10472-TAACCC
[03/21/2023 17:01:01] NOTICE: Step 1: finding anchor location in reads
[03/21/2023 17:01:01] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8 -x map-ont ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/anchors.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/chr9-10000-10472-TAACCC.fastq > ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10000-10472-TAACCC/anchor_locations.paf 2> /dev/null
[03/21/2023 17:01:01] NOTICE: Step 1 finished
[03/21/2023 17:01:01] NOTICE: Step 2: round 1 and round 2 estimation
[03/21/2023 17:01:01] NOTICE: Step 2 finished
[03/21/2023 17:01:01] NOTICE: Step 3: round 3 estimation
[03/21/2023 17:01:01] NOTICE: Step 3 finished
[03/21/2023 17:01:01] NOTICE: Writing to repeat size file...
[03/21/2023 17:01:01] NOTICE: Step 4: phasing reads using GMM
[03/21/2023 17:01:01] ERROR! No reads were found for repeat region: chr9-10000-10472-TAACCC
[03/21/2023 17:01:01] NOTICE: Quantifying repeat: chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG
[03/21/2023 17:01:01] NOTICE: Step 1: finding anchor location in reads
[03/21/2023 17:01:01] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8 -x map-ont ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/anchors.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG.fastq > ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/anchor_locations.paf 2> /dev/null
[03/21/2023 17:01:01] NOTICE: Step 1 finished
[03/21/2023 17:01:02] NOTICE: Step 2: round 1 and round 2 estimation
[03/21/2023 17:01:02] NOTICE: Running command: /sw/spack/bio/pkgs/gcc-10.3.0/minimap2/2.14-jvscyilw/bin/minimap2 -c -t 8  -x map-ont  ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/round1_ref.fasta ~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/core_sequences.fastq > ~/hsinlun/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-10976-11052-CCGGCGCAGGCGCAGAGAGGCGCGCCGCG/round1.paf 2> /dev/null
[03/21/2023 17:01:02] NOTICE: Step 2 finished
[03/21/2023 17:01:02] NOTICE: Step 3: round 3 estimation
[03/21/2023 17:01:02] NOTICE: Step 3 finished
[03/21/2023 17:01:02] NOTICE: Writing to repeat size file...
[03/21/2023 17:01:02] NOTICE: Step 4: phasing reads using GMM
[03/21/2023 17:01:05] NOTCIE: Number of alleles=1
[03/21/2023 17:01:06] NOTICE: Writing phasing results...
[03/21/2023 17:01:06] NOTICE: Writing to output fastq files...
[03/21/2023 17:01:06] NOTICE: Writing summary file...
[03/21/2023 17:01:06] NOTICE: Plotting figures...
[03/21/2023 17:01:06] NOTICE: Quantifying repeat: chr9-11165-11194-G

I guess we fixed one of my issues since I got the following output files (excerpted some of them):

C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.hist.png
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.phased_reads.txt
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.repeat_size.txt
C9ORF72_c1_simulated_reads.chr9-96950-97000-TCTCCTAATAATTACTAATAAG.summary.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.allele1.fastq
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.hist.png
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.phased_reads.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.repeat_size.txt
C9ORF72_c1_simulated_reads.chr9-99978-100145-GGGGCCTAAATGGTGATTGGCTCTGCTCTTGACCAATTGAACTCCATGCTCGGACT.summary.txt

However, I still got the same error message at this run as I mentioned in the previous post:

[03/21/2023 17:04:49] NOTICE: Quantifying repeat: chr9-106460-108380-CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA
Traceback (most recent call last):
  File "~/bin/NanoRepeat/nanoRepeat.py", line 185, in <module>
    main()
  File "~/bin/NanoRepeat/nanoRepeat.py", line 174, in main
    preprocess_fastq(input_args)
  File "~/bin/NanoRepeat/nanoRepeat.py", line 91, in preprocess_fastq
    nanoRepeat_bam.nanoRepeat_bam(input_args, in_bam_file)
  File "~/bin/NanoRepeat/nanoRepeat_bam.py", line 578, in nanoRepeat_bam
    quantify1repeat_from_bam(input_args, in_bam_file, ref_fasta_dict, repeat_region)
  File "~/bin/NanoRepeat/nanoRepeat_bam.py", line 523, in quantify1repeat_from_bam
    os.makedirs(temp_out_dir, exist_ok=True)
  File "/sw/spack/bio/pkgs/gcc-10.3.0/python/3.9.7-rv5ybzg3/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 36] File name too long: '~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-106460-108380CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA'

Whether is it caused by the length of repeat?

I would appreciate it if you could advice.

Sincerely, Hsin

HLHsieh commented 1 year ago

Hi,

Here is the updates.

The tool worked successfully after removing "chr9-106460-108380-CCCTGTGATATTGT..." from my chr9_hg38_NanoRepeat.bed file.

Do you have any comments on it?

Best, Hsin

fangli80 commented 1 year ago

Hello Hisin I can see this error: OSError: [Errno 36] File name too long: '~/stimulated_test/C9ORF72_c1_simulated_reads_NanoRepeat/C9ORF72_c1_simulated_reads.NanoRepeat_temp_dir.chr9-106460-108380CCCTGTGATATTGTTTCTAATATCCAGAGGGGAGAAGATGATATTACTCCCAATATCAGAAGGGGTGTACACCCCTCCTGTGATATTGTTCCTAATATCCAGGGGGGGAGAGGATGATATTACTCCCAATATCGCAGGAGGTGTACACTCCCCCTGTGATATTGTTCCTAATATCCAGGGGGGAAGAGGATGATATTACTCCCAATATCGCTGGGGGTGTACACCCCCCCTGTGATATTGTTCCTAATATCCACGGGGGAGAGAGAATGATATTACTCCCAATATCGCAGGGGGTGTACACA

NanoRepeat will insert the repeat unit into the file name. If the repeat unit is too long, the filename may exceed the max length of the operating system.

I will fix this bug later today

HLHsieh commented 1 year ago

Hi Li,

Thanks for your prompt response. Could I consider that NanoRepeat is able to detect the repeat unit more than 10 base pair in this case?

Best, Hsin

fangli80 commented 1 year ago

Yes. I think it should work for repeats more than 10bp. There is no length limit in terms of the algorithm, but I have not tested on repeats that are several hundred bp.

HLHsieh commented 1 year ago

Good to know that! Appreciated all your help.

fangli80 commented 1 year ago

You're welcome.

HLHsieh commented 1 year ago

Hi @fangli80

I am just following up on whether you have fixed the filename issue. Have you considered not insert the repeat unit into the file name or specify it by users? Does it work?

Many Thanks, Hsin

fangli80 commented 1 year ago

Hello Hsin, Sorry for the late update. The file name issue has been fixed in the latest version. Please use https://github.com/WGLab/NanoRepeat/releases/tag/v1.3 If the repeat unit is longer than 30bp, only part of the sequence will be inserted into the file name.

Best, Li

HLHsieh commented 1 year ago

Hello Li,

Thank you! I tried it and it went smoothly.

Thanks, Hsin

fangli80 commented 1 year ago

Thank you for the feedback!