bergmanlab / TELR

TELR is a fast non-reference transposable element detector from long read sequencing data.
https://github.com/bergmanlab/TELR
BSD 2-Clause "Simplified" License
31 stars 11 forks source link

Issue with Contig Assembly in TELR #37

Open prakashnarayanan98 opened 7 months ago

prakashnarayanan98 commented 7 months ago

Description:

Click to expand for Sample of processing error ``` Successfully created the directory /TELR/intermediate_files/vcf_ins_repeatmask RepeatMasker version open-4.0.7 Search Engine: NCBI/RMBLAST [ 2.6.0+ ] Rebuilding RepeatMaskerLib.embl library - Read in 216 sequences from /miniconda3/envs/TELR/share/RepeatMasker/Libraries/DfamConsensus.embl RepeatMaskerLib.embl: 216 total sequences. Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 ) Custom Repeat Library: /TELR/intermediate_files/LIBRARY.fasta Warning...unknown stuff < > Building general libraries in: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/dc20170127/general analyzing file /TELR/intermediate_files/Read.vcf_ins.fasta identifying matches to LIBRARY.fasta sequences in batch 1 of 11 identifying matches to LIBRARY.fasta sequences in batch 2 of 11 identifying matches to LIBRARY.fasta sequences in batch 3 of 11 identifying matches to LIBRARY.fasta sequences in batch 4 of 11 identifying matches to LIBRARY.fasta sequences in batch 5 of 11 identifying matches to LIBRARY.fasta sequences in batch 6 of 11 identifying matches to LIBRARY.fasta sequences in batch 7 of 11 identifying matches to LIBRARY.fasta sequences in batch 8 of 11 identifying matches to LIBRARY.fasta sequences in batch 9 of 11 identifying matches to LIBRARY.fasta sequences in batch 10 of 11 identifying matches to LIBRARY.fasta sequences in batch 11 of 11 processing output: cycle 1 . cycle 2 . Generating output... . masking done Successfully created the directory /TELR/intermediate_files/sv_reads Successfully created the directory /TELR/intermediate_files/contig_assembly assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed assembly failed Successfully created the directory /TELR/intermediate_files/vcf_seq2contig Use repeatmasker to annotate contig TE families instead of minimap2 Successfully created the directory /TELR/intermediate_files/contig_te_repeatmask RepeatMasker version open-4.0.7 Search Engine: NCBI/RMBLAST [ 2.6.0+ ] Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 ) Custom Repeat Library: /TELR/intermediate_files/LIBRARY.fasta Warning...unknown stuff < > analyzing file /TELR/intermediate_files/Read.fa identifying matches to LIBRARY.fasta sequences in batch 1 of 10 identifying matches to LIBRARY.fasta sequences in batch 2 of 10 identifying matches to LIBRARY.fasta sequences in batch 3 of 10 identifying matches to LIBRARY.fasta sequences in batch 4 of 10 identifying matches to LIBRARY.fasta sequences in batch 5 of 10 identifying matches to LIBRARY.fasta sequences in batch 6 of 10 identifying matches to LIBRARY.fasta sequences in batch 7 of 10 identifying matches to LIBRARY.fasta sequences in batch 8 of 10 identifying matches to LIBRARY.fasta sequences in batch 9 of 10 identifying matches to LIBRARY.fasta sequences in batch 10 of 10 processing output: cycle 1 . cycle 2 . Generating output... . masking done Done Successfully created the directory /TELR/intermediate_files/telr_reads Scf_2L_22107544_22107544 no assembly Scf_2L_22734202_22734202 no assembly Scf_2R_380878_380881 no assembly Scf_2R_2670123_2670123 no assembly Scf_3L_23424019_23424021 no assembly Scf_NODE_103476_626_627 no assembly Scf_NODE_105063_6969_6970 no assembly Scf_NODE_11571_24517_24519 no assembly Scf_NODE_12809_1023_1023 no assembly Scf_NODE_18214_489_489 no assembly Scf_NODE_24465_1162_1163 no assembly Scf_NODE_26715_949_952 no assembly Scf_NODE_3168_468_468 no assembly Scf_NODE_36936_1052_1057 no assembly Scf_NODE_37551_815_815 no assembly Scf_NODE_39506_601_603 no assembly Scf_NODE_46678_5042_5042 no assembly Scf_NODE_5267_896_897 no assembly Scf_NODE_59901_2861_2861 no assembly Scf_NODE_60709_627_628 no assembly Scf_NODE_68951_87_88 no assembly Scf_NODE_69473_1091_1091 no assembly Scf_NODE_72290_1975_1976 no assembly Scf_NODE_76112_1320_1320 no assembly Scf_NODE_98642_1306_1307 no assembly Successfully created the directory /TELR/intermediate_files/ref_repeatmask RepeatMasker version open-4.0.7 Search Engine: NCBI/RMBLAST [ 2.6.0+ ] Master RepeatMasker Database: /miniconda3/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 ) Custom Repeat Library: /TELR/intermediate_files/LIBRARY.fasta Warning...unknown stuff < > ```

Environment:

Issue: The contig assembly process in TELR is encountering multiple failures, leading to the generation of empty assemblies for several sequences.

Observed Behavior:

Logs:

11/29/2023 06:05:31: INFO: Parsing input files...
11/29/2023 06:05:31: INFO: Raw reads are provided
11/29/2023 06:05:31: INFO: Start alignment...
11/29/2023 22:24:36: INFO: Sort and index BAM...
11/29/2023 22:41:39: INFO: First alignment finished in 16 hours 36 minutes 8 seconds
11/29/2023 22:41:39: INFO: Detecting SVs from BAM file...
11/29/2023 23:00:59: INFO: SV detection finished in 19 minutes 19 seconds
11/29/2023 23:00:59: INFO: Parse structural variant VCF...
11/29/2023 23:02:33: INFO: Perform local assembly of non-reference TE loci...
11/30/2023 00:58:32: INFO: Local assembly finished in 1 hours 55 minutes 58 seconds
11/30/2023 00:58:32: INFO: Annotate contigs...
11/30/2023 01:02:07: INFO: Estimating allele frequency...
11/30/2023 01:02:46: INFO: Perform local realignment...
11/30/2023 01:12:33: INFO: Local realignment finished in 9 minutes 46 seconds
11/30/2023 01:13:58: INFO: Allele frequency estimation finished in 11 minutes 11 seconds
11/30/2023 03:25:49: INFO: Map contigs to reference...
11/30/2023 04:10:01: INFO: Write output...
11/30/2023 04:10:09: INFO: TELR finished in 22 hours 4 minutes 37 seconds

Additional Information:

Notes:

This issue is hindering the progress of the project. Any assistance or guidance in resolving this matter would be greatly appreciated.

shunhuahan commented 7 months ago

Hi @prakashnarayanan98,

Thanks for reporting the error. A few things:

  1. Can you describe the input FASTQ data, the library, and the reference genome? We have tested TELR on drosophila melanogaster dataset but not on other species.
  2. The "assembly failed" message is due to the assembler not being able to produce contigs. Based on your command line, I think you are using the default wtdbg2 assembler. Can you try with telr -i read.fastq -l library.fasta -r reference.fasta --assembler flye --polisher flye and see if switching to flye for assembly and polishing could help.

Thanks, Shunhua

prakashnarayanan98 commented 7 months ago

FASTQ Data:

Library: chakraborty_simulans_TE

shunhuahan commented 6 months ago

Thanks @prakashnarayanan98 for providing this info.

We haven't yet tested TELR extensively on simulans data, so there is no guarantee that the entire workflow will be issue-free for this species. Did you get any successful assemblies for most insertion candidates? If the assembly failure is only for a small subset of all non-reference TE insertions, you can potentially look into rescuing those assemblies ad-hoc. Below are files you can use for this purpose.

If you provide --keep_files when running TELR, all intermediate files will be kept under <output_dir>/intermediate_files.