Closed Wenwen012345 closed 1 year ago
Hello,
Thanks for reaching out to us. Unfortunately, this appears to be an issue with RepeatMasker itself as the program is not running properly and might be related to an issue with the recent conda build (see https://github.com/rmhubley/RepeatMasker/issues/221). If you can, I would wait until you can get the official version and try that.
Nick
On Mon, Jun 26, 2023 at 1:40 PM Wenwen @.***> wrote:
Dear @panchyni https://github.com/panchyni @shinhanshiu https://github.com/shinhanshiu
Hello there! This is a very good tool that suits my needs very well and I look forward to referencing it later. I'm having a RepeatMasker problem while running the test pipeline for PseudogenePipeline, please review? I have the conda version of RepeatMasker. Due to internet speed problems I may not be able to get the official version until two days later. Anyway, I'm not quite sure what the problem is at the moment.
(base) @.***:~/tool/PseudogenePipeline-master/_test25$ python ~/tool/PseudogenePipeline-master/_scripts/PseudogenePipeline.py test_parameter_file here
Step 0: read, check, and set parameters
Check parameter file... Check names among genome/protein seq, and blast output Determining intron length using provided 95th percentile intron length 95th percentile intron legnth is 456 Set default values default: ev_t=5 default: id_t=40 default: ml_t=30 default: ml_p=0.05
Step 1. Filtering BLAST output...
Filter test_prot25_tblastn6... E:5 I:40 L:30 P:0.05
Step 2. Get pseudoexon
Pseudoexon file: test_prot25_tblastn6_parsed_G456.PE
Step 3. Get phase 1 pseudogene
Get initial phase 1 pseudogene phase1 file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1 Get pair file and subject coordinates pair file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs coordinates file:test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord Get phase 1 pseudogene sequences phase 1 seq: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa
Step 4. Get Smith-Waterman alignments with fasta
Find stop and framshifts
here Program : tfasty36 Pair list: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs Fasta1 : test_prot25.fa Fasta2 : test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa Fasta dir: /home/w/tool/fasta36-36.3.8/bin Working d: tmp_1687798035.136786/ Flags : -m 3 -q E thres : 1.0 Read gene pairs... 52 pairs Read fasta files... Do sw... 0 x 100
Done! smith-Waterman outputs: test_prot25_tblastn6_parsed_G456.PE_I456.PS1_pairs.sw* Read BLOSUM50 matrix... Read the sw.out file... Compare sequences: total: 52 alignments 0.0 Done! disable_count file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs.sw.out.disable_count
Step 5. Convert coordinates
Convert S-W coordinates to genome-based ones processed Disable_count file: test_prot25_tblastn6_parsed.4col Process Maker format GFF if necessary
Step 6. Determine overlap with annotated genes
Find Overlaps between predicted pseudogenes and genes output: test_prot25_tblastn6_parsed.4col.onlyoverlap overlap file: test_prot25_tblastn6_parsed.4col.onlyoverlap Do not filter pseudogene predictions overlapping with genes output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count
Step 7. Repeat masking
Repeat masking prep here Sequence to dict... Read coordinates... 0 k shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod 2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref
Run RepeatMasker /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 45: =head1: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 47: RepeatMasker: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 49: =head1: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 51: Syntax error: "(" unexpected
Parse RepeatMasker output Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in
file1=open(sys.argv[1],'r') FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out' Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in four = open(sys.argv[1]) #4input coded file FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col' Get simple repeats grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录
Filter out repeats disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt 4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt repeat Masked files end with '.RMfilt'
Step 8. Run high-confidence filter
Get genome and protein sequence sizes here Total length: 18585056 here Total length: 10253
Run High-Confidence Filter high-Confidence files: '.hiConf'
Replace gene-names here name corrected files end with '.cdnm'
Write Pseudogene GFF GFF files: test_prot25_tblastn6_parsed.4col.true.fa.ref test_prot25_tblastn6_parsed.4col.true.RMfilt.hiConf.cdnm.gff
Step 9. Clean up
Removing temporary genome and protein sequences files Moving intermediate files to _intermediate Move hiConf, codename and gff files to _results Move log files to _logs
Done!!!
log 1.txt https://github.com/ShiuLab/PseudogenePipeline/files/11872123/log.1.txt
— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPKMIAL6XJB57JNKFJTXNHCR5ANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>
Thank you for your kind reply. Just got up. I'll try that, working on installing the library now (according to http://www.repeatmasker.org/RepeatMasker/). It may consume some time as well.
Hello @panchyni ! Update, I have RepeatMakser installed but I get errors on the species. I tried to fill in some common species (e.g. Arabidopsis thaliana, Homo sapiens) but it still reports the same error and I'm a bit confused.
#### Step 7. Repeat masking
Repeat masking prep
here
Sequence to dict...
Read coordinates...
0 k
shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod
2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref
Run RepeatMasker
RepeatMasker version 4.1.4
Search Engine: NCBI/RMBLAST [ 2.14.0+ ]
Using Master RepeatMasker Database: /home/w/tool/RepeatMasker-4.1.4/RepeatMasker/Libraries/RepeatMaskerLib.h5
Title :
Version :
Date :
Families :
Species "viridiplantae" is not known to RepeatMasker. There may
not be any TE families defined in the libraries for this
species/clade or there may be an error in the spelling.
Please check your entry against the NCBI Taxonomy database
and/or try using a broader clade or related species instead.
The full list of species/clades defined in the library may be
obtained using the famdb.py script.
Parse RepeatMasker output
Traceback (most recent call last):
File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in <module>
file1=open(sys.argv[1],'r')
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out'
Traceback (most recent call last):
File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in <module>
four = open(sys.argv[1]) #4input coded file
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col'
Get simple repeats
grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录
Filter out repeats
disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt
4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt
fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt
repeat Masked files end with '.RMfilt'
My RepeatMasker installation process is referenced at http://www.repeatmasker.org/RepeatMasker/. And I only installed the RepBaseRepeatMaskerEdition library (because the complete Dfam 3.7 library is too big).
In addition, I just wanted to ask about the genome I am using which is already a repeat sequence masked genome (using EDTA, https://github.com/oushujun/EDTA). Is it possible to disregard this step when running my genome file and just ignore the error message?
You may need to get the full version of Dfam (see https://www.dfam.org/releases/Dfam_3.7/families/README.txt) which recommends using FamDB (https://github.com/Dfam-consortium/FamDB/). FamDB will also allow you to explore the database and check what the taxa names it uses are.
On Mon, Jun 26, 2023 at 11:42 PM Wenwen @.***> wrote:
Hello! Update, I have RepeatMakser installed but I get errors on the species. I tried to fill in some common species (e.g. Arabidopsis thaliana, Homo sapiens) but it still reports the same error and I'm a bit confused.
Step 7. Repeat masking
Repeat masking prep here Sequence to dict... Read coordinates... 0 k shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod 2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref
Run RepeatMasker RepeatMasker version 4.1.4 Search Engine: NCBI/RMBLAST [ 2.14.0+ ]
Using Master RepeatMasker Database: /home/w/tool/RepeatMasker-4.1.4/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Version : Date : Families :
Species "viridiplantae" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.
Parse RepeatMasker output Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in
file1=open(sys.argv[1],'r') FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out' Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in four = open(sys.argv[1]) #4input coded file FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col' Get simple repeats grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录
Filter out repeats disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt 4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt repeat Masked files end with '.RMfilt'
In addition, I just wanted to ask about the genome I am using which is already a repeat sequence masked genome (using EDTA, https://github.com/oushujun/EDTA). Is it possible to ignore the error message of this step when running my genome file?
— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13#issuecomment-1608736067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPMLDUEOD4WU4HW2KWLXNJJB3ANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>
You may need to get the full version of Dfam (see
Thanks so much for your kind help!
@panchyni Hi, I have run through the whole process; I will check the results afterwards. By the way, the installation process of RepeatMasker is really slow and cumbersome.
That sounds about right for RepeatMasker, unfortunately. Still the best option for what we do, as far as I know.
On Fri, Jun 30, 2023 at 10:00 AM Wenwen @.***> wrote:
@panchyni https://github.com/panchyni Hi, I have run through the whole process; I will check the results afterwards. By the way, the installation process of RepeatMasker is really slow and cumbersome.
— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13#issuecomment-1614697475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPO6625R6VOWFIV5DCTXN3LYLANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>
@panchyni Thank you very much. Yes, understood, RepeatMasker is definitely the core software for identifying repeat sequences at the moment, but its new version has some bugs that don't work with the old database RepBase (https://github.com/rmhubley/RepeatMasker/issues/199), so it keeps reporting errors. The good thing is that in the end the error has no effect on the results. I lost some patience yesterday and it was my problem.
Anyway, thank you very much for your help and for developing such a meaningful and excellent software!
Dear @panchyni @shinhanshiu
Hello there! This is a very good tool that suits my needs very well and I look forward to referencing it later. I'm having a RepeatMasker problem while running the test pipeline for PseudogenePipeline, please review? I have the conda version of RepeatMasker. Due to internet speed problems I may not be able to get the official version until two days later. Anyway, I'm not quite sure what the problem is at the moment.
test_parameter_file.txt