Empty results - Githubissues

jsitarka commented 3 years ago

Hi team,

I am using your program in my master thesis development on Arabidopsis genome and I always get empty both .ref.bed and nonref.bed files. Is there any desired format of short reads to get a valid result?

I run command: conda activate ngs_te_mapper2 python3 /path/to/source/code/ngs_te_mapper2/sourceCode/ngs_te_mapper2.py -o output -f SRR8397878_1.fastq -r TAIR10_chr_all.fasta -l RepeatMaskerPlants_nospace.fasta

Reads were downloaded from: https://www.ebi.ac.uk/ena/browser/view/SRR8397878

Reference fasta was downloaded from: https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release%2FTAIR10_chromosome_files

I used library using in RepeatMasker (but I had to remove spaces and special characters like \,? etc. from id description - because there were some problems during intermediates directories creation e.g. in case there was the line ">Gypsy-17 LBS-I LTR Gypsy" corresponding folder name was only "Gypsy-17" but the program expected full name)

The library has a structure like this:

Gypsy-17_LBS-I_LTR_Gypsy aaggtggacactgtgggaaccaacagcaacctggccggcgtaacagcaga ... BEL-154_AA-LTR_LTR_Pao tgtctacgaccaacaaaacctacttatccctcattactctactggtgcaa ... Copia-1_DYa-I_LTR_Copia ataggttatgggcccaggagtagtaaagactttaataattgtgtgtgatc ...

Message from log file: 04/23/2021 23:31:16: INFO: CMD: ../ngs_te_mapper2/sourceCode/ngs_te_mapper2.py -o output -f SRR8397878_1.fastq -r TAIR10_chr_all.fasta -l RepeatMaskerPlants_nospace.fasta 04/23/2021 23:31:16: INFO: Parsing input files... 04/24/2021 09:52:42: INFO: Start alignment... 04/24/2021 11:49:47: INFO: Alignment finished in 1 hours 57 minutes 4 seconds 04/24/2021 11:49:47: INFO: Detecting insertions... 04/24/2021 14:42:50: INFO: Insertion candidate search finished in 2 hours 53 minutes 2 seconds 04/24/2021 14:43:07: INFO: ngs_te_mapper finished in 15 hours 11 minutes 49 seconds 04/24/2021 14:43:07: INFO: Number of reference TEs: 0 04/24/2021 14:43:07: INFO: Number of non-reference TEs: 0

Am I doing something wrong? Thanks a lot for your advice :)

shunhuahan commented 3 years ago

@jsitarka Thanks for your interest in testing ngs_te_mapper2 on Arabidopsis genome. We haven't yet tested our program on Arabidopsis dataset but we are happy to get you feedback on whether ngs_te_mapper2 can perform reasonably good on your dataset.
To fully reproduce your zero prediction results, could you send me your transformed TE library so we can start a replication analysis and figure out what's causing the issue? You can send it over to shhan@uga.edu.
Also, you might want to try out McClintock, which is an easy to use meta-pipeline for identifying TE insertions using multiple detection methods (including ngs_te_mapper2). You could read details and follow instructions on this page https://github.com/bergmanlab/mcclintock. Our group is currently maintaining this software so feel free to ask questions related to McClintock under its issue page and we will get back to you.

Best, Shunhua

shunhuahan commented 3 years ago

Hi @jsitarka,

Sorry for the late reply! I’ve got a chance to take a look at your input files and finished a test run. I was able to reproduce the zero prediction issues you reported.
There are two issues contributing to zero prediction results. One is a bug in the source code that would cause issues when the contig names in the reference genome don't contain chr substring. I fixed the bug in the latest commit (https://github.com/bergmanlab/ngs_te_mapper2/commit/f7281d3261cabe871fea4ad05d305391f5abd2cc).
Another issue is from your TE library. ngs_te_mapper2, and most other TE detection methods, expect TE consensus sequences (separated by family). I believe one of the TE library you sent me is from https://www.arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FTAIR10_genome_release%2FTAIR10_transposable_elements, which includes sequences of all TE insertions in the reference genome. Most TE families in the library fasta file are overrepresented, which is not recommended and might cause issues.
In my testing, I ran the latest ngs_te_mapper2 using one TE family sequence from TAIR10 database as library file. I was able to get non-zero predictions in the final output.
Let me know if that makes sense to you and if you have other questions.

Shunhua

shunhuahan commented 3 years ago

To clarify what the input TE library file should include, I updated ngs_te_mapper2 README in https://github.com/bergmanlab/ngs_te_mapper2/commit/d6ed0941cb847771ba5fa17e82f2ef5b54351bd8.
Closing this issue for now. @jsitarka Feel free to re-open this issue if the latest update doesn't work for you.

bergmanlab / ngs_te_mapper2

Empty results #2