Closed milesandersonmn closed 2 years ago
Hi
The .out files may have slight differences across different RepeatMasker versions. In this case, the first column seems to be the problem. Can you just get rid of it (11 in the first line, 13 in the second etc.) and try again? I'll try to find a way to autodetect it somehow, but this would be a quick work around. We don't use those numbers in sonic anyway.
Can
I edited the file with this awk command: awk '{$1=""}1' referenceScaffolds.fasta.out | awk -vOFS="\t" '{$1=$1}1'
Here is the file head:
perc perc perc query position in query matching repeat position in repeat div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
8.1 0.0 10.3 Scaffold_12628_Chr2 619 650 (59402968) + (TTAAT)n Simple_repeat 129 (0) 1 26.7 2.0 2.0 Scaffold_12628_Chr2 2013 2063 (59401555) + (TTG)n Simple_repeat 1 51 (0) 2 21.2 0.0 2.2 Scaffold_12628_Chr2 5577 5622 (59397996) + A-rich Low_complexity 1 45 (0) 3 0.0 0.0 0.0 Scaffold_12628_Chr2 8956 8975 (59394643) + (AT)n Simple_repeat 1 20 (0) 4 0.0 0.0 0.0 Scaffold_12628_Chr2 19293 19309 (59384309) + (A)n Simple_repeat 1 17 (0) 5 36.4 0.0 0.0 Scaffold_12628_Chr2 20177 20256 (59383362) + (AAGAT)n Simple_repeat 180 (0) 6 17.8 4.3 2.1 Scaffold_12628_Chr2 32896 32941 (59370677) + (TAATA)n Simple_repeat 147 (0) 7 9.1 0.0 0.0 Scaffold_12628_Chr2 35207 35241 (59368377) + (GAATAG)n Simple_repeat 135 (0) 8 12.0 0.0 3.6 Scaffold_12628_Chr2 35845 35873 (59367745) + (TATAT)n Simple_repeat 128 (0) 9 25.0 6.1 0.0 Scaffold_12628_Chr2 36092 36157 (59367461) + (ATAATA)n Simple_repeat 170 (0) 10 7.8 3.2 6.7 Scaffold_12628_Chr2 36580 36610 (59367008) + (TTTAT)n Simple_repeat 130 (0) 11 1.6 0.0 0.0 Scaffold_12628_Chr2 40352 40480 (59363138) + (T)n Simple_repeat 1 129 (0) 12
The command fails but with the following error message: return_value is 0 - sw_score=perc. RepeatMasker .out file has a problem. Exiting SONIC.
Thanks for the help! I'm wondering if the header might be the issue.
hmm it should skip the header but it somehow didn't (see sw_score=perc -- the first string in the header). That control was already coded but somehow failed. removing the header should work, but I just pushed a small fix that "should" do the trick. Please pull.
Still generates the same error code unfortunately. Worked by using "tail -n +4
Let me know if you want me to continue to test if you push new commits though. I have all the files available and it runs quite quickly.
can you send me some test files so I can debug it directly? just the Scaffold_12628_Chr2 fasta file should be sufficient
Sure. I can send the chromosome 6 fasta. That's the smallest chromosome. Do you want me to run RepeatMasker on it and send the output files for it as well?
yes that would be perfect
thanks for the test file. Please pull the latest. I tested it with the following commands with empty gap and dup files. Do not edit the repeatmasker file, no need any more:
../sonic --ref chr6.fasta --gaps gap.bed --dups chr6.dups.bed --reps chr6.fasta.out --make-sonic chr6.sonic --info chr6-debug Number of chromosomes: 1 Adding gap intervals to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding segmental duplication intervals to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding 19322 repeats to SONIC. Read 19319 BED entries. Writing entries for chromosome 0 Wrote 19319 entries. Adding GC profile SONIC. SONIC file chr6.sonic is ready. Memory usage: 6.16 MB.
../sonic --test-sonic chr6.sonic Loading SONIC file..
SONIC Info: chr6-debug Built in Tue May 10 09:11:30 2022
Number of chromosomes: 1 Loading gap intervals... 0 intervals loaded. Loading duplication intervals... 0 intervals loaded. Loading repeats... 19319 intervals loaded. SONIC file loaded. Memory usage: 1.96 MB. [SONIC] Number of chromosomes: 1 [SONIC] Genome length: 53750732 [SONIC] The SONIC file chr6.sonic seems to be valid. Memory usage: 1.96 MB.
Works perfect! Closing the issue now. Thanks!
I'm trying to create a SONIC file to run valor2 on 10x linked reads, but I get the following error message when building my SONIC file:
return_value is 0 - sw_score=bit. RepeatMasker .out file has a problem. Exiting SONIC.
Here is the head of my "referenceScaffolds.fasta.out" file:
bit perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
11 8.1 0.0 10.3 Scaffold_12628_Chr2 619 650 (59402968) + (TTAAT)n Simple_repeat 1 29 (0) 1 13 26.7 2.0 2.0 Scaffold_12628_Chr2 2013 2063 (59401555) + (TTG)n Simple_repeat 1 51 (0) 2 17 21.2 0.0 2.2 Scaffold_12628_Chr2 5577 5622 (59397996) + A-rich Low_complexity 1 45 (0) 3 18 0.0 0.0 0.0 Scaffold_12628_Chr2 8956 8975 (59394643) + (AT)n Simple_repeat 1 20 (0) 4 16 0.0 0.0 0.0 Scaffold_12628_Chr2 19293 19309 (59384309) + (A)n Simple_repeat 1 17 (0) 5 17 36.4 0.0 0.0 Scaffold_12628_Chr2 20177 20256 (59383362) + (AAGAT)n Simple_repeat 1 80 (0) 6 15 17.8 4.3 2.1 Scaffold_12628_Chr2 32896 32941 (59370677) + (TAATA)n Simple_repeat 1 47 (0) 7
Any ideas what could be causing the issue? I thought maybe it was the inclusion of underscores in the chromosome names, but couldn't find anything in the documentation.
Here's the command:
sonic --ref referenceScaffolds.fasta --dups segdups.bedpe --reps referenceScaffolds.fasta.out --gaps referenceScaffoldsGaps.bed --make-sonic referenceScaffolds.sonic