DerKevinRiehl / transposon_annotation_reasonaTE

Transposon annotation tool "resonaTE" (part of TransposonUltimate)
GNU General Public License v3.0
16 stars 1 forks source link

Parsing of repeatModeler and RepeatMasker #13

Closed mgrew closed 2 years ago

mgrew commented 2 years ago

Hi Kevin,

thanks for creating this amazing pipeline. The approach to combine differently sensitive annotation tools is much needed and the detailed documentation allows a quick implementation even for computational rookies like me.

All tools ran successfully (RepeatMasker and ltrPred externally), however, during the "parseAnnotation" step the script gets stuck on the outputs of RepeatModeler and Repeatmasker and gives the following error messages:

Parse repeatModeler...
Traceback (most recent call last):
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/TransposonAnnotator.py", line 114, in <module>
    parseAvailableResults(projectFolderPath)
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 1346, in parseAvailableResults
    parseRepeatModeler(pathResDir, fastaFile, targetGFFFile, targetGFFrepe, targetFastaFile)
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 1243, in parseRepeatModeler
    start  = int(transposons[0].split(":")[1].split("-")[0])
IndexError: list index out of range
Parse RepeatMasker...
Traceback (most recent call last):
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/TransposonAnnotator.py", line 114, in <module>
    parseAvailableResults(projectFolderPath)
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 1353, in parseAvailableResults
    parseRepeatMasker(pathResDir, fastaFile, targetGFFFile, targetGFFrepe, targetFastaFile)
  File "/home/mgrewoldt/miniconda3/envs/transposon_annotation_tools_env/share/TransposonAnnotator_reasonaTE/AnnotationParser.py", line 905, in parseRepeatMasker
    chrom = transposons[4].replace(">","")
IndexError: list index out of range

I conclude that the structure of the according output files is not as expected, though I can't find any difference in format between the outputs and folder structure in testProject data and mine (only first few lines showed):

repeatModel (sequence_index-families.stk):

# STOCKHOLM 1.0
#=GF ID    rnd-1_family-367
#=GF DE    RepeatModeler Generated - rnd-1_family-367, RepeatScout: [ Index = R=298, RS Size = 150, Refiner Input Size = 100, Final Multiple Alignment Size = 100 ]
#=GF TP    Interspersed_Repeat;Unknown
#=GF SQ    108
#=GC RF    xxxxx.xxxxxxxxxxx....xxxxx.xxx..x...xx.xx....x...x..........x.x...x...x..x.x...x.......xx.xx.x...x..x..x.x.x....x.x.x.xx..xx...xx..x..xx....xx...x.xxx.x...xx.xxx...xx.x.xx.xxxx.xxxxx..xx.xx.x..x...xx...x.x.....x....xxxx..........xx.xxxx.x......x.x.....x.xx....x...x...x...xx...xxx..xx......x.xx.xx..x........x..xxx.....x.xx..x....xx..x....x....x..x.x.xx.xxxxxxx..xxxx....xx...xxxxxxxxxxxxx
seq2:38152860-38153033    TAAGG.GGCCGTTCATA....AATTA.CGT..A...AC.GC....A...A..........G.A...G...G..G.G...G.......GA.GG.G...G..G..G.G.G....T.A.T.GA..GG...CA..A..GC....GT...T.ACG.G...TT.CTA...AC.A.AA.ATTG.GGTCA..AA.CT.T..G...AC...C.C.....T....TTTA..........GC.GTTA.C......A.G.....A.CCCG..G...G...G...GG...GGG..GG......T.CT.GA..A........A..AAT.....G.TT..G....AT..T....T....T..T.G.CG.TTACGTA..ATTT....AT...GAACGGCCCCTAA

RepeatMasker (sequence.fasta.out)

SW scoret% div.t% del.t% ins.tquery sequencetpos in  query: begintendt(left)trepeattclass/familytpos in repeat: begintendt(left)tID

14  16.9    3.0 0.0 II  13769   13801   (58329382)  GA-rich Low_complexity  1   34  (0) 1

I could of course skip these tools, but since they do a thorough TE annotation according to your paper, I'd prefer to include at least one of them.

Do you have any idea what the problem could be? Am I overlooking something?

Thanks in advance & best wishes, Malte

DerKevinRiehl commented 2 years ago

Dear Malte, first of all thank you very much for your interest in our software.

Concerning your problem with repeatModeler The problems with repeatModelersound familiar to me from other people reporting that. You may find a solution in this thread. https://github.com/DerKevinRiehl/TransposonUltimate/issues/3#issuecomment-1115257052 Does this do the trick for you?

Concerning your problem with repeatMasker For your issue with repeatMasker, may you share more files that are in the folder produced by repeatMasker? I would like to dig deeper into it, as you are the first user experiencing this issue.

Looking forward to your answer soon, Best regards, Kevin

mgrew commented 2 years ago

Hi Kevin,

thanks for your quick reply! Your solution regarding repeatModeler indeed did the trick. Thanks for pointing me towards that easy fix.

I ran repeatMasker on Galaxy (v4.0.9) with the Dfam.h5 library, which produced the desired output files:

sequence.fasta (normal fasta format) sequence.fasta.masked (normal fasta format with masked Ns) sequence.fasta.cat sequence.fasta.out sequence.fasta.tbl

Thanks for your help! Best wishes, Malte

mgrew commented 2 years ago

Hi again, I solved it. The .fasta.out file from the Galaxy version of repeatMasker was tab-delimited, while it had to be space-delimited for the parsing script to work. In retrospect, I could have spotted this earlier. Sorry and thanks again!

Best wishes, Malte