I am using your program in my master thesis development on Arabidopsis genome and I always get empty both .ref.bed and nonref.bed files. Is there any desired format of short reads to get a valid result?
I run command:
conda activate ngs_te_mapper2
python3 /path/to/source/code/ngs_te_mapper2/sourceCode/ngs_te_mapper2.py -o output -f SRR8397878_1.fastq -r TAIR10_chr_all.fasta -l RepeatMaskerPlants_nospace.fasta
I used library using in RepeatMasker (but I had to remove spaces and special characters like \,? etc. from id description - because there were some problems during intermediates directories creation e.g. in case there was the line ">Gypsy-17 LBS-I LTR Gypsy" corresponding folder name was only "Gypsy-17" but the program expected full name)
@jsitarka Thanks for your interest in testing ngs_te_mapper2 on Arabidopsis genome. We haven't yet tested our program on Arabidopsis dataset but we are happy to get you feedback on whether ngs_te_mapper2 can perform reasonably good on your dataset.
To fully reproduce your zero prediction results, could you send me your transformed TE library so we can start a replication analysis and figure out what's causing the issue? You can send it over to shhan@uga.edu.
Also, you might want to try out McClintock, which is an easy to use meta-pipeline for identifying TE insertions using multiple detection methods (including ngs_te_mapper2). You could read details and follow instructions on this page https://github.com/bergmanlab/mcclintock. Our group is currently maintaining this software so feel free to ask questions related to McClintock under its issue page and we will get back to you.
Sorry for the late reply! I’ve got a chance to take a look at your input files and finished a test run. I was able to reproduce the zero prediction issues you reported.
In my testing, I ran the latest ngs_te_mapper2 using one TE family sequence from TAIR10 database as library file. I was able to get non-zero predictions in the final output.
Let me know if that makes sense to you and if you have other questions.
Hi team,
I am using your program in my master thesis development on Arabidopsis genome and I always get empty both .ref.bed and nonref.bed files. Is there any desired format of short reads to get a valid result?
I run command: conda activate ngs_te_mapper2 python3 /path/to/source/code/ngs_te_mapper2/sourceCode/ngs_te_mapper2.py -o output -f SRR8397878_1.fastq -r TAIR10_chr_all.fasta -l RepeatMaskerPlants_nospace.fasta
Reads were downloaded from: https://www.ebi.ac.uk/ena/browser/view/SRR8397878
Reference fasta was downloaded from: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release%2FTAIR10_chromosome_files
I used library using in RepeatMasker (but I had to remove spaces and special characters like \,? etc. from id description - because there were some problems during intermediates directories creation e.g. in case there was the line ">Gypsy-17 LBS-I LTR Gypsy" corresponding folder name was only "Gypsy-17" but the program expected full name)
The library has a structure like this:
Message from log file: 04/23/2021 23:31:16: INFO: CMD: ../ngs_te_mapper2/sourceCode/ngs_te_mapper2.py -o output -f SRR8397878_1.fastq -r TAIR10_chr_all.fasta -l RepeatMaskerPlants_nospace.fasta 04/23/2021 23:31:16: INFO: Parsing input files... 04/24/2021 09:52:42: INFO: Start alignment... 04/24/2021 11:49:47: INFO: Alignment finished in 1 hours 57 minutes 4 seconds 04/24/2021 11:49:47: INFO: Detecting insertions... 04/24/2021 14:42:50: INFO: Insertion candidate search finished in 2 hours 53 minutes 2 seconds 04/24/2021 14:43:07: INFO: ngs_te_mapper finished in 15 hours 11 minutes 49 seconds 04/24/2021 14:43:07: INFO: Number of reference TEs: 0 04/24/2021 14:43:07: INFO: Number of non-reference TEs: 0
Am I doing something wrong? Thanks a lot for your advice :)