bergmanlab / ngs_te_mapper2

Software for detecting transposable element insertions from next-generation sequencing data
BSD 2-Clause "Simplified" License
9 stars 1 forks source link

No reference TE insertion depending on format of fasta header #4

Closed afeurtey closed 2 years ago

afeurtey commented 3 years ago

Many thanks for this nice pipeline!

I am currently testing ngs_te_mapper2 on my organism of interest. After quite some debugging, it seems that the format of the consensus TE fasta headers can impact the results of the pipeline, which I assume is not voluntary! When I add a "#XXX" at the end of the headers (a modification I made after looking at the provided example), the number of detected reference insertions is 0. As an example, the format ">DHX-incomp_B-G243-Map3_reversed " would be converted into ">DHX-incomp_B-G243-Map3_reversed#DHX". This was true for two different TE annotations, in which the headers have different formats. This bug seems to impact only the reference TE insertions, but not the non-reference TE insertions.

I also wonder if there would be a way to add a warning when repeat masker fails to produce a masked genome. I had such an issue initially (fixed by changing the genome file name to remove special characters), but the pipeline seemingly ran without issues despite this. It is still possible to find by adding the --keep_files option, however, since there is currently no description (or at least none that I found) of the files that should be produced in the intermediary steps, it is a bit difficult to identify a missing file! An alternative to creating warnings (difficult for each special case!) could be a list of the relevant files produced by the pipeline?

Best, Alice

shunhuahan commented 3 years ago

Hi @afeurtey,

Thanks for the detailed and informative bug report!

Regarding to you first issue, yes this is an oversight on our end, I just implemented an update in the source code to remove substrings starting with # in the header of input TE library file. Let me know if the latest update still don't work for you.

About the second issue, I realized that the + symbol in the input sequence file name would cause RepeatMasker to generate no masked genome without throwing error messages. I just made an update to address this issue. Otherwise if no reference TE is actually present in the reference genome, you should find "No repetitive sequences detected" message reported in the standard output, and ngs_te_mapper2 would use the original reference genome for subsequent process in this case.

I also encourage you to try out https://github.com/bergmanlab/mcclintock, which includes ngs_te_mapper2 and many other TE detection methods, the McClintock platform includes a series of preprocessing steps to ensure that most of the file format related issues are taken care of before running the detection methods.

Please let me know if you still have questions.

Best, Shunhua