ShiuLab / PseudogenePipeline

Creative Commons Zero v1.0 Universal
36 stars 16 forks source link

RepeatMasker problem #13

Closed Wenwen012345 closed 1 year ago

Wenwen012345 commented 1 year ago

Dear @panchyni @shinhanshiu

Hello there! This is a very good tool that suits my needs very well and I look forward to referencing it later. I'm having a RepeatMasker problem while running the test pipeline for PseudogenePipeline, please review? I have the conda version of RepeatMasker. Due to internet speed problems I may not be able to get the official version until two days later. Anyway, I'm not quite sure what the problem is at the moment.

(base) w@w:~/tool/PseudogenePipeline-master/_test25$ python ~/tool/PseudogenePipeline-master/_scripts/PseudogenePipeline.py test_parameter_file
here

#### Step 0: read, check, and set parameters

Check parameter file...
Check names among genome/protein seq, and blast output
Determining intron length
  using provided 95th percentile intron length
  95th percentile intron legnth is 456
Set default values
  default: ev_t=5
  default: id_t=40
  default: ml_t=30
  default: ml_p=0.05

#### Step 1. Filtering BLAST output...

Filter test_prot25_tblastn6...
 E:5 I:40 L:30 P:0.05

#### Step 2. Get pseudoexon

Pseudoexon file: test_prot25_tblastn6_parsed_G456.PE

#### Step 3. Get phase 1 pseudogene

Get initial phase 1 pseudogene
  phase1 file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1
Get pair file and subject coordinates
  pair file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs
  coordinates file:test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord
Get phase 1 pseudogene sequences
  phase 1 seq: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa

#### Step 4. Get Smith-Waterman alignments with fasta

Find stop and framshifts

here
Program  : tfasty36
Pair list: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs
Fasta1   : test_prot25.fa
Fasta2   : test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa
Fasta dir: /home/w/tool/fasta36-36.3.8/bin
Working d: tmp_1687798035.136786/
Flags    : -m 3 -q
E thres  : 1.0
Read gene pairs...
 52 pairs
Read fasta files...
Do sw...
 0 x 100

Done!
  smith-Waterman outputs: test_prot25_tblastn6_parsed_G456.PE_I456.PS1_pairs.sw*
Read BLOSUM50 matrix...
Read the sw.out file...
Compare sequences:
 total: 52 alignments
 0.0
Done!
  disable_count file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs.sw.out.disable_count

#### Step 5. Convert coordinates

Convert S-W coordinates to genome-based ones
  processed Disable_count file: test_prot25_tblastn6_parsed.4col
Process Maker format GFF if necessary

#### Step 6. Determine overlap with annotated genes

Find Overlaps between predicted pseudogenes and genes
 output: test_prot25_tblastn6_parsed.4col.onlyoverlap
  overlap file: test_prot25_tblastn6_parsed.4col.onlyoverlap
Do not filter pseudogene predictions overlapping with genes
  output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count

#### Step 7. Repeat masking

Repeat masking prep
here
Sequence to dict...
Read coordinates...
 0 k
  shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod
  2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref

Run RepeatMasker
/home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 45: =head1: not found
/home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 47: RepeatMasker: not found
/home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 49: =head1: not found
/home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 51: Syntax error: "(" unexpected

Parse RepeatMasker output
Traceback (most recent call last):
  File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in <module>
    file1=open(sys.argv[1],'r')
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out'
Traceback (most recent call last):
  File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in <module>
    four = open(sys.argv[1])     #4input coded file                                    
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col'

Get simple repeats
grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录

Filter out repeats
  disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt
  4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt
  fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt
  repeat Masked files end with '.RMfilt'

#### Step 8. Run high-confidence filter

Get genome and protein sequence sizes
here
Total length: 18585056
here
Total length: 10253

Run High-Confidence Filter
  high-Confidence files: '.hiConf'

Replace gene-names
here
  name corrected files end with '.cdnm'

Write Pseudogene GFF
  GFF files: test_prot25_tblastn6_parsed.4col.true.fa.ref test_prot25_tblastn6_parsed.4col.true.RMfilt.hiConf.cdnm.gff

#### Step 9. Clean up

Removing temporary genome and protein sequences files
Moving intermediate files to _intermediate
Move hiConf, codename and gff files to _results
Move log files to _logs

Done!!!

test_parameter_file.txt

panchyni commented 1 year ago

Hello,

Thanks for reaching out to us. Unfortunately, this appears to be an issue with RepeatMasker itself as the program is not running properly and might be related to an issue with the recent conda build (see https://github.com/rmhubley/RepeatMasker/issues/221). If you can, I would wait until you can get the official version and try that.

Nick

On Mon, Jun 26, 2023 at 1:40 PM Wenwen @.***> wrote:

Dear @panchyni https://github.com/panchyni @shinhanshiu https://github.com/shinhanshiu

Hello there! This is a very good tool that suits my needs very well and I look forward to referencing it later. I'm having a RepeatMasker problem while running the test pipeline for PseudogenePipeline, please review? I have the conda version of RepeatMasker. Due to internet speed problems I may not be able to get the official version until two days later. Anyway, I'm not quite sure what the problem is at the moment.

(base) @.***:~/tool/PseudogenePipeline-master/_test25$ python ~/tool/PseudogenePipeline-master/_scripts/PseudogenePipeline.py test_parameter_file here

Step 0: read, check, and set parameters

Check parameter file... Check names among genome/protein seq, and blast output Determining intron length using provided 95th percentile intron length 95th percentile intron legnth is 456 Set default values default: ev_t=5 default: id_t=40 default: ml_t=30 default: ml_p=0.05

Step 1. Filtering BLAST output...

Filter test_prot25_tblastn6... E:5 I:40 L:30 P:0.05

Step 2. Get pseudoexon

Pseudoexon file: test_prot25_tblastn6_parsed_G456.PE

Step 3. Get phase 1 pseudogene

Get initial phase 1 pseudogene phase1 file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1 Get pair file and subject coordinates pair file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs coordinates file:test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord Get phase 1 pseudogene sequences phase 1 seq: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa

Step 4. Get Smith-Waterman alignments with fasta

Find stop and framshifts

here Program : tfasty36 Pair list: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs Fasta1 : test_prot25.fa Fasta2 : test_prot25_tblastn6_parsed_G456.PE_I456.PS1.subj_coord.fa Fasta dir: /home/w/tool/fasta36-36.3.8/bin Working d: tmp_1687798035.136786/ Flags : -m 3 -q E thres : 1.0 Read gene pairs... 52 pairs Read fasta files... Do sw... 0 x 100

Done! smith-Waterman outputs: test_prot25_tblastn6_parsed_G456.PE_I456.PS1_pairs.sw* Read BLOSUM50 matrix... Read the sw.out file... Compare sequences: total: 52 alignments 0.0 Done! disable_count file: test_prot25_tblastn6_parsed_G456.PE_I456.PS1.pairs.sw.out.disable_count

Step 5. Convert coordinates

Convert S-W coordinates to genome-based ones processed Disable_count file: test_prot25_tblastn6_parsed.4col Process Maker format GFF if necessary

Step 6. Determine overlap with annotated genes

Find Overlaps between predicted pseudogenes and genes output: test_prot25_tblastn6_parsed.4col.onlyoverlap overlap file: test_prot25_tblastn6_parsed.4col.onlyoverlap Do not filter pseudogene predictions overlapping with genes output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count

Step 7. Repeat masking

Repeat masking prep here Sequence to dict... Read coordinates... 0 k shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod 2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref

Run RepeatMasker /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 45: =head1: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 47: RepeatMasker: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 49: =head1: not found /home/w/miniconda3/pkgs/repeatmasker-4.1.5-pl5321hdfd78af_0/share/RepeatMasker/RepeatMasker: 51: Syntax error: "(" unexpected

Parse RepeatMasker output Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in file1=open(sys.argv[1],'r') FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out' Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in four = open(sys.argv[1]) #4input coded file FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col'

Get simple repeats grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录

Filter out repeats disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt 4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt repeat Masked files end with '.RMfilt'

Step 8. Run high-confidence filter

Get genome and protein sequence sizes here Total length: 18585056 here Total length: 10253

Run High-Confidence Filter high-Confidence files: '.hiConf'

Replace gene-names here name corrected files end with '.cdnm'

Write Pseudogene GFF GFF files: test_prot25_tblastn6_parsed.4col.true.fa.ref test_prot25_tblastn6_parsed.4col.true.RMfilt.hiConf.cdnm.gff

Step 9. Clean up

Removing temporary genome and protein sequences files Moving intermediate files to _intermediate Move hiConf, codename and gff files to _results Move log files to _logs

Done!!!

log 1.txt https://github.com/ShiuLab/PseudogenePipeline/files/11872123/log.1.txt

— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPKMIAL6XJB57JNKFJTXNHCR5ANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>

Wenwen012345 commented 1 year ago

Thank you for your kind reply. Just got up. I'll try that, working on installing the library now (according to http://www.repeatmasker.org/RepeatMasker/). It may consume some time as well.

Wenwen012345 commented 1 year ago

Hello @panchyni ! Update, I have RepeatMakser installed but I get errors on the species. I tried to fill in some common species (e.g. Arabidopsis thaliana, Homo sapiens) but it still reports the same error and I'm a bit confused.

#### Step 7. Repeat masking

Repeat masking prep
here
Sequence to dict...
Read coordinates...
 0 k
  shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod
  2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref

Run RepeatMasker
RepeatMasker version 4.1.4
Search Engine: NCBI/RMBLAST [ 2.14.0+ ]

Using Master RepeatMasker Database: /home/w/tool/RepeatMasker-4.1.4/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : 
  Version  : 
  Date     : 
  Families : 

Species "viridiplantae" is not known to RepeatMasker.  There may
not be any TE families defined in the libraries for this
species/clade or there may be an error in the spelling.
Please check your entry against the NCBI Taxonomy database
and/or try using a broader clade or related species instead.
The full list of species/clades defined in the library may be
obtained using the famdb.py script.

Parse RepeatMasker output
Traceback (most recent call last):
  File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in <module>
    file1=open(sys.argv[1],'r')
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out'
Traceback (most recent call last):
  File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in <module>
    four = open(sys.argv[1])     #4input coded file                                    
FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col'

Get simple repeats
grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录

Filter out repeats
  disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt
  4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt
  fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt
  repeat Masked files end with '.RMfilt'

My RepeatMasker installation process is referenced at http://www.repeatmasker.org/RepeatMasker/. And I only installed the RepBaseRepeatMaskerEdition library (because the complete Dfam 3.7 library is too big).

In addition, I just wanted to ask about the genome I am using which is already a repeat sequence masked genome (using EDTA, https://github.com/oushujun/EDTA). Is it possible to disregard this step when running my genome file and just ignore the error message?

panchyni commented 1 year ago

You may need to get the full version of Dfam (see https://www.dfam.org/releases/Dfam_3.7/families/README.txt) which recommends using FamDB (https://github.com/Dfam-consortium/FamDB/). FamDB will also allow you to explore the database and check what the taxa names it uses are.

On Mon, Jun 26, 2023 at 11:42 PM Wenwen @.***> wrote:

Hello! Update, I have RepeatMakser installed but I get errors on the species. I tried to fill in some common species (e.g. Arabidopsis thaliana, Homo sapiens) but it still reports the same error and I'm a bit confused.

Step 7. Repeat masking

Repeat masking prep here Sequence to dict... Read coordinates... 0 k shortened name file: test_prot25_tblastn6_parsed.4col.true.fa.mod 2 col reference file: test_prot25_tblastn6_parsed.4col.true.fa.ref

Run RepeatMasker RepeatMasker version 4.1.4 Search Engine: NCBI/RMBLAST [ 2.14.0+ ]

Using Master RepeatMasker Database: /home/w/tool/RepeatMasker-4.1.4/RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Version : Date : Families :

Species "viridiplantae" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.

Parse RepeatMasker output Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/parse_RepeatMasker_gff.py", line 5, in file1=open(sys.argv[1],'r') FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out' Traceback (most recent call last): File "/home/w/tool/PseudogenePipeline-master/_test25/../_scripts/1_3_removeCodeNamesFromRM4col.py", line 9, in four = open(sys.argv[1]) #4input coded file FileNotFoundError: [Errno 2] No such file or directory: 'test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col'

Get simple repeats grep: test_prot25_tblastn6_parsed.4col.true.fa.mod.mod.out.300Cutoff30.0divg.4col.longNames: 没有那个文件或目录

Filter out repeats disable_count output file: test_prot25_tblastn6_parsed.fullyFiltered.disable_count.RMfilt 4col format output file: test_prot25_tblastn6_parsed.4col.true.RMfilt fasta format output file: test_prot25_tblastn6_parsed.4col.true.fa.RMfilt repeat Masked files end with '.RMfilt'

In addition, I just wanted to ask about the genome I am using which is already a repeat sequence masked genome (using EDTA, https://github.com/oushujun/EDTA). Is it possible to ignore the error message of this step when running my genome file?

— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13#issuecomment-1608736067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPMLDUEOD4WU4HW2KWLXNJJB3ANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>

Wenwen012345 commented 1 year ago

You may need to get the full version of Dfam (see

Thanks so much for your kind help!

Wenwen012345 commented 1 year ago

@panchyni Hi, I have run through the whole process; I will check the results afterwards. By the way, the installation process of RepeatMasker is really slow and cumbersome.

panchyni commented 1 year ago

That sounds about right for RepeatMasker, unfortunately. Still the best option for what we do, as far as I know.

On Fri, Jun 30, 2023 at 10:00 AM Wenwen @.***> wrote:

@panchyni https://github.com/panchyni Hi, I have run through the whole process; I will check the results afterwards. By the way, the installation process of RepeatMasker is really slow and cumbersome.

— Reply to this email directly, view it on GitHub https://github.com/ShiuLab/PseudogenePipeline/issues/13#issuecomment-1614697475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDZPO6625R6VOWFIV5DCTXN3LYLANCNFSM6AAAAAAZUPRQDA . You are receiving this because you were mentioned.Message ID: @.***>

Wenwen012345 commented 1 year ago

@panchyni Thank you very much. Yes, understood, RepeatMasker is definitely the core software for identifying repeat sequences at the moment, but its new version has some bugs that don't work with the old database RepBase (https://github.com/rmhubley/RepeatMasker/issues/199), so it keeps reporting errors. The good thing is that in the end the error has no effect on the results. I lost some patience yesterday and it was my problem.

Anyway, thank you very much for your help and for developing such a meaningful and excellent software!