Won't identify TEs that are created with EDTA for non-model organism

W-L / deviaTE

Python tool for the analysis and visualization of mobile genetic elements

GNU General Public License v3.0

19 stars 7 forks source link

Won't identify TEs that are created with EDTA for non-model organism #10

Open cahende opened 2 years ago

cahende commented 2 years ago

Hello,

I am trying to run this for a set of raw sequences for Anopheles gambiae. I used EDTA to create a TE library from the agamP4 genome assembly and then used my raw sequences as input for this pipeline to identify which TEs are present in which samples we have. Following trimming/mapping, the pipeline attempts to identify TEs but I get the following error for every TE identified by EDTA.

Starting analysis of [TE] in [RAW DATA]-final.fastq.fused.sort.bam..

No annotaions found for: [TE]

Traceback (most recent call last): File "/home/ch943/bin/miniconda/envs/deviaTE_env/bin/deviaTE_analyse", line 100, in sample.write_frame(out=args.output + '.raw', insertions=ihat, command=comm, t=timestamp, norm='raw') File "/home/ch943/bin/miniconda/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 204, in write_frame with open(out, 'w') as outfile: FileNotFoundError: [Errno 2] No such file or directory: '[RAW DATA]-final.fastq.[TE].raw'

Any guidance would be appreciated.

W-L commented 2 years ago

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs
Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? cheers

cahende commented 2 years ago

Hi,

So the TE names that EDTA output actually had a "/" in all the names, so I think that is the issue. I corrected this in my reference library and am rerunning now, I will let you know if this issue persists.

Thanks! Cory

On Tue, Mar 29, 2022 at 12:52 AM W-L @.***> wrote:

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs

Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? cheers

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1081544642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cahende commented 2 years ago

Hi,

The naming convention was the issue, it seems to be running fine now. On a side note - I am scanning for the presence of a large list of transposable elements and many don't have any reads mapping. Is there any way to prevent output from being produced when there are no reads mapping to a particular element?

Thank you, Cory

On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson @.***> wrote:

Hi,

So the TE names that EDTA output actually had a "/" in all the names, so I think that is the issue. I corrected this in my reference library and am rerunning now, I will let you know if this issue persists.

Thanks! Cory

On Tue, Mar 29, 2022 at 12:52 AM W-L @.***> wrote:

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs

Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? cheers

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1081544642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cahende commented 2 years ago

This is the output I get for each transposable element in my test, to me this suggests there are no reads mapping to this particular TE?

**** Analysis

Starting analysis of TE_00000718_INT#LTR-unknown in SRR10235406-final.fastq.fused.sort.bam..

No annotaions found for: TE_00000718_INT#LTR-unknown

Normalization: none (values are raw abundances)

Analysis completed - output written to: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown

**** Visualization

Loading data: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown

Visualization written to: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown.pdf

On Thu, Mar 31, 2022 at 11:13 AM Cory Henderson @.***> wrote:

Hi,

The naming convention was the issue, it seems to be running fine now. On a side note - I am scanning for the presence of a large list of transposable elements and many don't have any reads mapping. Is there any way to prevent output from being produced when there are no reads mapping to a particular element?

Thank you, Cory

On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson @.***> wrote:

Hi,

So the TE names that EDTA output actually had a "/" in all the names, so I think that is the issue. I corrected this in my reference library and am rerunning now, I will let you know if this issue persists.

Thanks! Cory

On Tue, Mar 29, 2022 at 12:52 AM W-L @.***> wrote:

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs

Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? cheers

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1081544642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

W-L commented 2 years ago

Hi! Glad that your original issue was solved. deviaTE should probably check for such situations itself to be fair. I'll implement a fix for that. Concerning your second question: If there are no reads mapping to a TE reference, then deviaTE should give a message like this:

...
******************** Analysis
Starting analysis of [TE] in [BAM-FILE]..

No reads mapped to the specified reference sequence
...

The program should then exit without producing any output. Hope this helps! Lukas

W-L commented 2 years ago

I added a check to replace invalid characters in TE names, which should prevent the original error (https://github.com/W-L/deviaTE/commit/10d2b7063b2fef7fcaa24b0a45fa655a0c4d7565). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (bin/deviaTE_analyse in this repository). In case you installed the tool via conda, it should be located somewhere along the lines of:

~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse

cahende commented 2 years ago

Thank you for creating a fix for that naming issue. I am still curious about the other issue where it said I had no annotations but I still received output, can you explain what that means?

Cory

On Tue, Apr 5, 2022 at 3:54 AM W-L @.***> wrote:

I added a check to replace invalid characters in TE names, which should prevent the original error (10d2b70 https://github.com/W-L/deviaTE/commit/10d2b7063b2fef7fcaa24b0a45fa655a0c4d7565). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (bin/deviaTE_analyse in this repository). In case you installed the tool via conda, it should be located somewhere along the lines of:

~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1088558242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWEQDDUR7TCBEV5QWWJLVDQLW5ANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

W-L commented 2 years ago

No problem! Forgot to mention that the fix is basically replacing problematic characters with dashes, so that the analysis can proceed without issues. The message about "no annotations" refers to the optional parameter --annotation. This can be used to provide GFF3 files with annotations of the TE sequences, e.g. the location of CDS and other defined genetic elements. These will mainly be used in the visualisation, e.g. at the bottom of this one:

cahende commented 2 years ago

Ahh, I see thanks for clarifying! So it is working as intended, fantastic.

I also wanted to broach another more broad question since I have your attention:

I am trying to identify TEs in unassembled natural genomes (not high enough coverage for a full assembly, especially for high repeat regions), so the library I am using is from TEs identified in a chromosome level genome build of a colony population. I feel like I will be missing potentially novel TEs circulating in these natural populations by using this method, which is the intent of this analysis. Can you provide any ideas on how to build a more fitting library for identification so I can identify TEs that might not be represented in the colony genome?

Thank you, Cory

On Tue, Apr 5, 2022 at 9:30 AM W-L @.***> wrote:

No problem! Forgot to mention that the fix is basically replacing problematic characters with dashes, so that the analysis can proceed without issues. The message about "no annotations" refers to the optional parameter --annotation. This can be used to provide GFF3 files with annotations of the TE sequences, e.g. the location of CDS and other defined genetic elements. These will mainly be used in the visualisation, e.g. at the bottom of this one: [image: image] https://user-images.githubusercontent.com/16755298/161801714-24779b2b-0c4d-4aeb-82e3-e7a74214f75b.png

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1088989042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWERM73GXIEPMOIXTUG3VDRTDTANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

W-L commented 2 years ago

That's a tricky one. I think a two-pronged approach might be worth considering in this case.

Repository-based: Try and collect all relevant sequences from already existing TE databases for the species (and related ones) that you are studying
De-novo assembly of repeats from raw reads: There are quite a few tools that can do this, but I don't know for which species and coverage they are suitable. Some that come to my mind are RepeatExplorer (https://pubmed.ncbi.nlm.nih.gov/23376349/), dnaPipeTE (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419797/), REPdenovo (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792456/).

You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content. A possibly helpful review with lots of links to databases & tools: https://www.nature.com/articles/s41576-018-0050-x#ref-CR77

cahende commented 2 years ago

Thank you for the very useful information. Let me get back to you when I have had a chance to run this. I appreciate your help!

Cory

On Thu, Apr 7, 2022 at 3:01 AM W-L @.***> wrote:

That's a tricky one. I think a two-pronged approach might be worth considering in this case.

Repository-based: Try and collect all relevant sequences from already existing TE databases for the species that you are studying

De-novo assembly of repeats from raw reads: There are quite a few tools that can do this, but I don't know for which species and coverage they are suitable. Some that come to my mind are RepeatExplorer ( https://pubmed.ncbi.nlm.nih.gov/23376349/), dnaPipeTE ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419797/), REPdenovo ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792456/).

You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content. A possibly helpful review with lots of links to databases & tools: https://www.nature.com/articles/s41576-018-0050-x#ref-CR77

— Reply to this email directly, view it on GitHub https://github.com/W-L/deviaTE/issues/10#issuecomment-1091467274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBUWEUQBMXASQDZVLQMZ73VD2W65ANCNFSM5R4ZDOOQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>