DerKevinRiehl / TransposonUltimate

TransposonUltimate - a holistic set of tools for transposon identification
GNU General Public License v3.0
81 stars 5 forks source link

RepeatModeler parsing error #3

Closed fantin-mesny closed 2 years ago

fantin-mesny commented 2 years ago

Dear Kevin,

Many thanks for developing TransposonUltimate.

I have used reasonaTE to run all the annotation tools on 70+ fungal genomes, and I am now proceeding to the parsing step.

For some genomes, it works without any problem. However, quite some genomes face the same issue when parsing the RepeatModeler output. Please see below

Parse must...
Parse NCBICDD1000 outputs...
Parse repeatModeler...
Traceback (most recent call last):
  File "/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/TransposonAnnotator.py", line 114, in <module>
    parseAvailableResults(projectFolderPath)
  File "/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 1346, in parseAvailableResults
    parseRepeatModeler(pathResDir, fastaFile, targetGFFFile, targetGFFrepe, targetFastaFile)
  File "/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 1243, in parseRepeatModeler
    start  = int(transposons[0].split(":")[1].split("-")[0])
IndexError: list index out of range

I tried rerunning reasonaTE with tool repeatmodel for the genomes showing this error during parsing, but this did not solve the issue.

Below is the content of the "repeatmodel" directory:

total 19M
drwxrws--- 7 mesny grp_hacquard 4.0K Apr 28 18:00 RM_23461.ThuApr281622292022
-rwxrwx--- 1 mesny grp_hacquard 115K Apr 28 18:00 sequence_index-families.fa
-rwxrwx--- 1 mesny grp_hacquard 9.2M Apr 28 18:00 sequence_index-families.stk
-rwxrwx--- 1 mesny grp_hacquard 6.7K Apr 28 16:22 sequence_index.nhr
-rwxrwx--- 1 mesny grp_hacquard 2.3K Apr 28 16:22 sequence_index.nin
-rwxrwx--- 1 mesny grp_hacquard 1.5K Apr 28 16:22 sequence_index.nnd
-rwxrwx--- 1 mesny grp_hacquard   52 Apr 28 16:22 sequence_index.nni
-rwxrwx--- 1 mesny grp_hacquard  784 Apr 28 16:22 sequence_index.nog
-rwxrwx--- 1 mesny grp_hacquard 9.7M Apr 28 16:22 sequence_index.nsq
-rwxrwx--- 1 mesny grp_hacquard 1.9K Apr 28 16:22 sequence_index.translation

Please let me know in case you know I do something wrong.

Best wishes, Fantin

DerKevinRiehl commented 2 years ago

Dear Fantin, thank you very much for your interest in our software.

I guess the output files of repeat modeler have some issues.

Could you do me a favor and share the file "-families.stk" with me please?

As you can see in the example file of the repo, repeatmodeler should output the file in stockholm file format. https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE/blob/main/workspace/testProject/repeatmodel/sequence_index-families.stk

What I can do is to check the problematic files by myself and write a code to fix the certain lines that give you trouble.

To do so, Please share the file "-families.stk" from your problematic genome with me

Best regards, Kevin

fantin-mesny commented 2 years ago

Dear Kevin,

Please find attached a problematic repeatmodel output, causing the parsing error mentioned in my previous message.

Many thanks for your help!

Best wishes, Fantin

sequence_index-families.stk.gz

DerKevinRiehl commented 2 years ago

Dear Fantin, thanks for your answer!

I figured out the problem, for some reason your RepeatModeler returns a stockholm file with empty lines (e.g. Line 4878 in your given stk file). Please clean your stk files from empty lines with my little program as explained below. Hint: Make sure to make a copy of your files before applying my small script just for safety as a backup.

Explanation about script: What I did: I wrote a small script that you could use to clean your stk files from empty lines. You can run this small program like that: python corrector.py FROM_FILE.stk TO_FILE.stk

Please find my script correct.py attached. corrector.zip

Otherwise (if you are experienced with python) just use following code and save it to a file "corrector.py":

# Author: Kevin Riehl for Transposon Ultimate Problems with RepeatModeler Outputs C 2022

# This code loads annotation outputs from RepeatModeler in Stockholm format and erases empty lines
# as these casue errors in the downstream pipeline of reasonaTE

# Usage: python corrector.py FROM_FILE.stk TO_FILE.stk

# get arguments
import sys
arguments = sys.argv
print(arguments)
if(len(arguments)==3):
    from_file = arguments[1]
    to_file = arguments[2]

    # read file and erase empty lines
    f1 = open(from_file, "r")
    f2 = open(to_file, "w+")

    line = " "
    last_line = " "
    while line!="":
        last_line = line
        line = f1.readline()
        if not (len(line.replace("\n",""))==0):
            f2.write(line)  
    f1.close()
    f2.close()

else:
    print("ERROR! No two arguments given from_file and to_file given!")

Please let me know if this did the trick for you. Best regards, Kevin

fantin-mesny commented 2 years ago

Dear Kevin,

Many thanks for your help! Removing the empty lines in the Stockholm files fixed the parsing problem.

Maybe you should implement this script in the reasonaTE programme.

Best regards. Fantin

DerKevinRiehl commented 2 years ago

Dear Fantin, thank you very much for your feedback! I am happy we could help you with that issue.

We will consider to include this in our next release. However, we are also wondering why the tool repeatmodeler behaves differently as it shouldnt produce empty lines.

Best regards, Kevin Riehl