Closed jacau closed 7 years ago
Good point! I tried to use a simple GFF or BED at very beginning. However, changed to use default result of Repeatmasker as it contains most comprehensive annotation of repeat annotation, particularly for model system with pretty good repeat libraries. For non-model organism where repeat annotation are generated by tools such as Repeatmodeler, Repeatcount, what I recommend is to create a repeat library with clearly defined TIRs for DNA transposon or LTRs for LTR retrotransposon. Then run Repeatmasker to annotation the reference genome using the custom repeat libraries. I have not tested in non-model organism yet.
In your case, if you feel confident of these accurate reference TE annotations in reference genome. You can extract these elements to create a repeat library to run Repeatmasker.
Thanks for the suggestion, I did that and it worked well.
I have another, unrelated question, what is the criteria for the output file "ALL.all_nonref_insert.high_conf.gff"?
ALL.all_nonref_insert.high_conf.gff are these calls supported by junction reads from both ends. Usually when the sequence coverage is high (~10X), the differences between high_conf and default results are very small. It is also depended on the quality of repeat library I guess.
How can I run RelocaTE2 to keep the intermediate files? I'm having trouble with steps downstream of the blat alignment, which is also the most time consuming but every run I lose all those blat output files.
You can use --step to control at which step you want to stop RelocaTE2. The default is to run all the steps (--step 1234567). You can run with "--step 123" to finish BLAT then run "--step 4567 --verbose 4" for downstream analysis. After each run of "--step 4567 --verbose 4" you can delete all the folder generated by "--step 4567 --verbose 4" and rerun "--step 4567 --verbose 4" if you want change something. "--verbose 4" is the option to keep intermediate files.
Can you describe what kind of trouble you have in the downstream steps? I can follow these issues in future development. Thanks.
Generally, the problems might be names of fastq files and read names in your data if you can run the test data. I used regular expression to match and extract informations from file name and sequence name.
It's not an issue with the fastq file names, the problem comes up in step 4, I get this error message in the log file:
Traceback (most recent call last):
File "/ohta1/jasmina.uzunovic/RelocaTE2/scripts/relocaTE_align.py", line 631, in
It is the limit of open files by Linux. Please try " ulimit -a" to check your default and use "ulimit -n 100000" to set a large value. Check this post for details: http://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit
Hi Jinfeng, I have RelocaTE2 running well now. I'm a little confused by the output files. What is the difference between these: ALL.all_nonref_insert.all.gff ALL.all_nonref_insert.gff ALL.all_nonref_insert.txt ALL.all_nonref.insert.raw.gff
Also, shouldn't ALL.all_nonref_insert.gff be the same as ALL.all_nonref_insert.txt? Because I'm getting different file lengths between the two.
Thanks!
Glad to hear it works for your data. That's great. I am using RelocaTE2 to do more non-model organisms too.
ALL.all_nonref_insert.gff is converted from ALL.all_nonref_insert.txt. But I did more filter on ALL.all_nonref_insert.gff. So these two files are expected to be different. ALL.all_nonref_insert.all.gff and ALL.all_nonref_insert.raw.gff should be same or almost same. We want to have all candidate calls in ALL.all_nonref_insert.raw.gff. ALL.all_nonref_insert.all.gff is more clean because some false calls are removed by overlapping with repeat element on reference genome. These steps are in function "Overlap_TE_boundary" at line 54 of clean_false_positive.py.
I would use either ALL.all_nonref_insert.gff and ALL.all_nonref_insert.high_conf.gff. If you want to do your own filter you could try ALL.all_nonref_insert.raw.gff or ALL.all_nonref_insert.all.gff.
Is it possible to input a more accurate reference TE annotation in place of RepeatMasker output? This would be a more simple format of element name, chromosome, start, end.