JinfengChen / RelocaTE2

RelocaTE2
MIT License
14 stars 7 forks source link

TE annotation #4

Closed jacau closed 7 years ago

jacau commented 8 years ago

Is it possible to input a more accurate reference TE annotation in place of RepeatMasker output? This would be a more simple format of element name, chromosome, start, end.

JinfengChen commented 8 years ago

Good point! I tried to use a simple GFF or BED at very beginning. However, changed to use default result of Repeatmasker as it contains most comprehensive annotation of repeat annotation, particularly for model system with pretty good repeat libraries. For non-model organism where repeat annotation are generated by tools such as Repeatmodeler, Repeatcount, what I recommend is to create a repeat library with clearly defined TIRs for DNA transposon or LTRs for LTR retrotransposon. Then run Repeatmasker to annotation the reference genome using the custom repeat libraries. I have not tested in non-model organism yet.

In your case, if you feel confident of these accurate reference TE annotations in reference genome. You can extract these elements to create a repeat library to run Repeatmasker.

jacau commented 8 years ago

Thanks for the suggestion, I did that and it worked well.

I have another, unrelated question, what is the criteria for the output file "ALL.all_nonref_insert.high_conf.gff"?

JinfengChen commented 8 years ago

ALL.all_nonref_insert.high_conf.gff are these calls supported by junction reads from both ends. Usually when the sequence coverage is high (~10X), the differences between high_conf and default results are very small. It is also depended on the quality of repeat library I guess.

jacau commented 8 years ago

How can I run RelocaTE2 to keep the intermediate files? I'm having trouble with steps downstream of the blat alignment, which is also the most time consuming but every run I lose all those blat output files.

JinfengChen commented 8 years ago

You can use --step to control at which step you want to stop RelocaTE2. The default is to run all the steps (--step 1234567). You can run with "--step 123" to finish BLAT then run "--step 4567 --verbose 4" for downstream analysis. After each run of "--step 4567 --verbose 4" you can delete all the folder generated by "--step 4567 --verbose 4" and rerun "--step 4567 --verbose 4" if you want change something. "--verbose 4" is the option to keep intermediate files.

Can you describe what kind of trouble you have in the downstream steps? I can follow these issues in future development. Thanks.

JinfengChen commented 8 years ago

Generally, the problems might be names of fastq files and read names in your data if you can run the test data. I used regular expression to match and extract informations from file name and sequence name.

  1. pair-end fastq files _1.fq and _2.fq _1.fq.gz and _2.fq.gz
  2. read name in fastq reads1: @NNN/1 or @NNN reads2: @NNN/2 or @NNN
jacau commented 8 years ago

It's not an issue with the fastq file names, the problem comes up in step 4, I get this error message in the log file:

Traceback (most recent call last): File "/ohta1/jasmina.uzunovic/RelocaTE2/scripts/relocaTE_align.py", line 631, in main() File "/ohta1/jasmina.uzunovic/RelocaTE2/scripts/relocaTE_align.py", line 624, in main map_reads_bwa(scripts, flanking_fq, path, genome_file, fastq_dir, target, bwa, samtools, seqtk, cpu, mate_file) File "/ohta1/jasmina.uzunovic/RelocaTE2/scripts/relocaTE_align.py", line 463, in map_reads_bwa info_chr_fh[unit[2]] = open('%s/flanking_seq/%s.flankingReads.unPaired.info' %(path, unit[2]), 'w') IOError: [Errno 24] Too many open files: '/ohta1/jasmina.uzunovic/183_20X_bed/repeat/flanking_seq/S223.flankingReads.unPaired.info'

JinfengChen commented 8 years ago

It is the limit of open files by Linux. Please try " ulimit -a" to check your default and use "ulimit -n 100000" to set a large value. Check this post for details: http://askubuntu.com/questions/181215/too-many-open-files-how-to-find-the-culprit

jacau commented 8 years ago

Hi Jinfeng, I have RelocaTE2 running well now. I'm a little confused by the output files. What is the difference between these: ALL.all_nonref_insert.all.gff ALL.all_nonref_insert.gff ALL.all_nonref_insert.txt ALL.all_nonref.insert.raw.gff

Also, shouldn't ALL.all_nonref_insert.gff be the same as ALL.all_nonref_insert.txt? Because I'm getting different file lengths between the two.

Thanks!

JinfengChen commented 8 years ago

Glad to hear it works for your data. That's great. I am using RelocaTE2 to do more non-model organisms too.

ALL.all_nonref_insert.gff is converted from ALL.all_nonref_insert.txt. But I did more filter on ALL.all_nonref_insert.gff. So these two files are expected to be different. ALL.all_nonref_insert.all.gff and ALL.all_nonref_insert.raw.gff should be same or almost same. We want to have all candidate calls in ALL.all_nonref_insert.raw.gff. ALL.all_nonref_insert.all.gff is more clean because some false calls are removed by overlapping with repeat element on reference genome. These steps are in function "Overlap_TE_boundary" at line 54 of clean_false_positive.py.

I would use either ALL.all_nonref_insert.gff and ALL.all_nonref_insert.high_conf.gff. If you want to do your own filter you could try ALL.all_nonref_insert.raw.gff or ALL.all_nonref_insert.all.gff.