TEdenovo error with 60bp file

pwkooij commented 4 years ago

I receive the following error message when when I try to run TEdenovo on my assembly

Fatal error: Exit code 1 ()
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Grouper_Map/genome_Blaster_Grouper_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Piler_Map/genome_Blaster_Piler_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Recon_Map/genome_Blaster_Recon_Map_consensus.fa: Aucun fichier ou dossier de ce type

The assembly has fasta lines no longer than 60bp, and the read names are formatted as follow >Ccostatus_symbiont_assembly_1

any suggestion?

avkermanov commented 4 years ago

Similar problem here, without details only:

There is no job registered for the following users: jeremy
Fatal error: Exit code 1 ()

JBerthelier commented 4 years ago

Dear pwkooij and avkermanov,

Sorry for my late answer,

It is difficult to me to help you with these few lines of error.

Can you share the log information ?

What is the lenght of the genome assembly that you are studying ? How much chr/contigs ?

pwkooij : in your case it seems that TEdenovo didn t reach the final stage to produce the concensus. Did you fix your error?

Best regards,

Jeremy

pwkooij commented 4 years ago

Hi Jeremy,

Currently, I can't reach the computer I'm running that analysis on, however, I had tried it three times and each time I received the error within a couple of minutes. Any idea?

JBerthelier commented 4 years ago

Dear pwkooij,

If the process stop after a couple of minute, there is lot of chance that it is cause by the the input file,

PASTEC/TEdenovo/TEannot are caprisious if there is a symbole that they don t like in the headers or sequences.

Here are the recommandation of the URGI (PASTEC creator)

ADVICE: For your own project, verify fasta file format, each nucleic line has only 60 bps (or less). About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers. Please, avoid space (" ") or symbols such as "=", ";", ":", "|"...

For more details : https://urgi.versailles.inra.fr/Tools/PASTEClassifier/PASTEClassifier-tuto

However, it can be also cause by the job running system , TEdenovo/PASTEC/TEannot use SGE,

This is very rare, but maybe a failed job blocks the launch of new jobs.

Could you tried to use the command "qstat" in a konsol and see if a failed job apear ? If nothing appear that s mean that it is normal.

Best regards,

Jeremy

pwkooij commented 4 years ago

Not sure why you closed this. I'm in quarantine and cannot test or check things at the moment...

JBerthelier commented 4 years ago

Ok, it's re-open, good luck with the quarantine!

Best regards,

Jeremy

pwkooij commented 4 years ago

I am now running the same dataset on my laptop at home and run into the same error. qstat doesn't show anything.

The headers are as follows: LcCcos_genome1

My assembly has 4778 contigs, total assembly length is 37.06 Mb

I used the FASTA_width formatter to have my sequences at 60bp per line

The log file shows the following:

START TEdenovo.py (2020-08-22 03:51:12)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 1
step 1 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:51:14)

START TEdenovo.py (2020-08-22 03:51:14)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 2
self-alignment with Blaster
The copy option is: False 
submitting job(s) with groupid 'genome_TEdenovo_S2_Blaster' (2020-08-22 03:51:14)
waiting for 5 job(s) with groupid 'genome_TEdenovo_S2_Blaster' (2020-08-22 03:51:15)
all jobs with groupid 'genome_TEdenovo_S2_Blaster' are finished (2020-08-22 03:52:35)
execution time of all jobs (seconds): 67.422469
execution time per job: n=5 mean=13.484 var=4.813 sd=2.194 min=9.937 med=14.449 max=15.128
submitting job(s) with groupid 'genome_TEdenovo_S2_Blaster_RmvPairAlignInChunkOverlaps' (2020-08-22 03:52:35)
waiting for 1 job(s) with groupid 'genome_TEdenovo_S2_Blaster_RmvPairAlignInChunkOverlaps' (2020-08-22 03:52:35)
execution time per job: n=1 mean=0.037 var=0.000 sd=0.000 min=0.037 med=0.037 max=0.037
submitting job(s) with groupid 'genome_TEdenovo_Blaster_FilterAlign' (2020-08-22 03:52:45)
waiting for 1 job(s) with groupid 'genome_TEdenovo_Blaster_FilterAlign' (2020-08-22 03:52:45)
execution time per job: n=1 mean=0.121 var=0.000 sd=0.000 min=0.121 med=0.121 max=0.121
step 2 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:53:00)

START TEdenovo.py (2020-08-22 03:53:00)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 3
self-alignment with Blaster
clustering with Grouper
submitting job(s) with groupid 'genome_TEdenovo_Blaster_Grouper_Matcher' (2020-08-22 03:53:00)
waiting for 1 job(s) with groupid 'genome_TEdenovo_Blaster_Grouper_Matcher' (2020-08-22 03:53:00)
execution time per job: n=1 mean=0.053 var=0.000 sd=0.000 min=0.053 med=0.053 max=0.053
submitting job(s) with groupid 'genome_TEdenovo_Blaster_Grouper_Grouper' (2020-08-22 03:53:15)
waiting for 1 job(s) with groupid 'genome_TEdenovo_Blaster_Grouper_Grouper' (2020-08-22 03:53:15)
execution time per job: n=1 mean=0.758 var=0.000 sd=0.000 min=0.758 med=0.758 max=0.758
step 3 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:53:35)

START TEdenovo.py (2020-08-22 03:53:35)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 3
self-alignment with Blaster
clustering with Piler
submitting job(s) with groupid 'genome_TEdenovo_Blaster_Piler' (2020-08-22 03:53:35)
waiting for 1 job(s) with groupid 'genome_TEdenovo_Blaster_Piler' (2020-08-22 03:53:35)
execution time per job: n=1 mean=0.496 var=0.000 sd=0.000 min=0.496 med=0.496 max=0.496
step 3 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:53:45)

START TEdenovo.py (2020-08-22 03:53:45)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 3
self-alignment with Blaster
clustering with Recon
submitting job(s) with groupid 'genome_TEdenovo_Blaster_Recon' (2020-08-22 03:53:45)
waiting for 1 job(s) with groupid 'genome_TEdenovo_Blaster_Recon' (2020-08-22 03:53:45)
execution time per job: n=1 mean=0.641 var=0.000 sd=0.000 min=0.641 med=0.641 max=0.641
step 3 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:54:00)

START TEdenovo.py (2020-08-22 03:54:00)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 4
multiple alignment with Map
WARNING: empty input file - no cluster found
step 4 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:54:00)

START TEdenovo.py (2020-08-22 03:54:00)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 4
multiple alignment with Map
WARNING: empty input file - no cluster found
step 4 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:54:00)

START TEdenovo.py (2020-08-22 03:54:01)
version 2.5
project name = genome
project directory = /home/jeremy/galaxy/tools/Pipeline/REPET/WORK
beginning of step 4
multiple alignment with Map
WARNING: empty input file - no cluster found
step 4 finished successfully
version 2.5
END TEdenovo.py (2020-08-22 03:54:01)

JBerthelier commented 4 years ago

Dear pwkooij,

I am sorry for my delay, your message was lost in the issues, I am reordering them to be more carefull ...

Unfortunatly, I never get this type of error before.

However, it seems clear that TEdenovo is not able to find repeat clusters :

"WARNING: empty input file - no cluster found",

Regarding this error, I would say that this tool is not suitable for your genome assembly, because it cannot found highly similar repeated sequences, which is a bit surprising.

Does your genome assembly has been obtained from short reads Illumina sequencing or Long read PacBio/nanopore ?

My guess now is that:

Either this species has very few/no highly similar repeats in its genome, which is interesting
Either the genome assembly miss highly similar repeat copies because of the difficulty to properly assemble them ( I am talking about it in the Pirate paper: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4763-1#Sec2 ). In that case, this approach is not suitable with your genome assembly, thus you should use the other tools to detect TEs.

That's exactly why I created PiRATE, to overcome this kind of issue. Combining approaches help to detects TEs sequences that are not repeated in the assembly. This because of the natural genome evolution (few similar copies of TEs in the genome) or because of technical artefact (badly assembled repeat copies). If you publish your TE annotation, that could be interesting to highlight this point in your manuscript.

Did you found TEs with repeatscout ? and with the other tools ?

You can also contact the creator of TEdenovo to have their advise about it: https://urgi.versailles.inra.fr/Tools/REPET/TEdenovo-tuto

Hope this will help you, sorry again for my delay

Best regards,

Jeremy

pwkooij commented 4 years ago

Hi Jeremy

Thanks for your feedback. To be honest, I was hoping to see interesting strange things in this particular one, because it is a mutualistic symbiont, however, I just want to make sure all is done correctly of course before drawing conclusions.

This is Illumina MiSeq data, paired-end 2x300. Appr. 50-60X coverage, high QC.

I had a look at the RepeatScout results as you suggested. The output gives me 205 sequences, however, after removing the less than 500bp sequences I end up with only 7 for that one.

Anything else I can look at to confirm this?

Cheers Pepijn

PS have people already tried running your pipeline with Nanopore? I ask a while ago and at that time you discouraged me to do so. But if you think it is possible with that I would like to give that a shot as well.

JBerthelier commented 4 years ago

Dear Pepijn,

Beacause you are using Illumina short-read reads sequencing, the assembly can miss some TE copies, and make it difficult to TEdenovo to identify repeated TE famillies.

However, this is still suprising to me that TEdenovo doesn t found any cluster of TEs. Repeatscout also look for repeated sequences, thus this also indicate that you have very few highly repeats in the genome assembly.

Most of the time ClassI TEs generate lot of copies in genome and, even in fragmented genome assembly, they are detectable with TEdenovo. Maybe this species lack ClassI TEs or has only few and old TEs copies that are not higly similars?

You can check if the other softwares (not based on repetitiveness: Repeatmasker, LTRharvest .... ) identificated ClassI TEs.

Also, Class II TEs can generate copies (sometime a lot such as MITE).

You should check among the candidates TEs that you found, if some are young TEs (complete ORF and intact LTRs or TIRs terminal sequences). Maybe, this species have only old TEs that can not be active anymore, and cannot create news copies? Which could explain the lack of highly repeated sequences? (for sure this is only speculation).

Best

Jeremy

pwkooij commented 4 years ago

Thanks Jeremy, I think this can be an interesting story unfolding...

On your advice I had a look at the results of some of the other software packages: Repeatmasker: 2731 sequences, but only 71 sequences after removing everything below 500bp LTRharvest: 9 sequences

MITE Hunter gives me 11 but only 1 after correction.

It seems that all packages produce low numbers. I'll continue the analyses and see where it will end, and hopefully, I can make a nice comparison with my other genome.

Quick other question, do you think it is possible to analyse fastq data obtained with Nanopore?

Cheers Pepijn

JBerthelier / PiRATE

TEdenovo error with 60bp file #32