Closed baligpanossian closed 5 months ago
In this case, TEstrainer has failed to produce a strained version of the input TE library. This could be because no non-satellite repeats were found in the RepeatModeler run, but this needs to be verified for the runs that failed. What version of Earl Grey are you using? Which OS? What flags did you use to run Earl Grey? Providing the whole log file should help to understand where the process has failed.
Verify that the -families.fa
files contain sequences that you expect. Are any families missing or unexpected?
If you are happy to ignore any filtering steps, you can just rename *-families.fa to /workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer/TS_515_WP-families.fa_7185/515_WP-families.fa.strained
and rerun Earl Grey with the same command to skip to the masking step. I strongly recommend against this as this runs the risk of TE annotations being wrong, as well as the consensus sequences not being refined
Thank you for the prompt response. It is possible that these samples had no non-satellite repeats, but I'd want to check before ruling that out. Also, I'm now only trying to get the raw data with which the pie charts are generated.
I've used EarlGrey 4.2.3 with a conda installation in linux. My command for this sample was
EarlGrey -t 8 -g 515_WP.fasta -s 515 -o 515
The full log file is attached
The script that failed was the simple repeat filter trimming step in TEstrainer... @jamesdgalbraith any ideas on this one?
It looks like all previous steps of TEstrainer completed successfully, so it is something in the post-processing that has caused an issue
Regarding the data used for the plots, this is in the _summaryFiles
directory after a successful run - All the plots are generated using SAMPLE.filteredrepeats.gff
Regarding the data used for the plots, this is in the
_summaryFiles
directory after a successful run - All the plots are generated using SAMPLE.filteredrepeats.gff
Thank you, this is perfect for the samples that ran successfully. As for those with an unsuccessful run that didn't generate anything in the _summaryFIles
, can I manually calculate the data from another file upstream of the summary?
You will need to run the final repeatmasker step, followed by the post-filtering. The easiest way to do this is to rename the families.fa with the .strained file, delete everything in mergedRepeats/ , RepeatMasker_Against_Custom_Library/ (but not the directories themselves) then run exactly the same command you did before, and Earl Grey will continue as if the TEstrainer step completed successfully.
I would still recommend trying to work out why the TEstrainer step failed, hopefully James could give us some more insight!
Sorry for the delay, I think I've identified the problem.
In the /workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer/TS_515_WP-families.fa_7185/TRF/
folder is there a fasta file tha ends with the extension .nonsatellite
? If not I think that's the error and I'll need to patch this.
Hello again, thank you for following up on this to try to find a fix. After attempting the workaround you suggested previously in this thread, I came across an error indicating that the .nonsatellite file you mention here is not found. I tried to fix this by adding a copy of the -families.fa file and renaming it to -families.fa.nonsatellite in the same directory (/TRF/) but it still didn't produce the summary files. Hope this helps narrow it down
This is planned to be patched in the next release, which should be going live shortly - thanks for brining this up!
Hello, I ran EarlGrey on 21 assemblies but it only produced summary files for 5/21. All runs produced SAMPLE-families.fa files showing representative sequences for each family, but the log files show the following:
This doesn't seem to be an assembly issue, because there were good and poor assembled sequences in both the successful runs and the failed runs.
I would appreciate any guidance on how/where to find the identified TEs in the temporary files in both successful and failed runs to calculate raw values for percentage of TEs covering the genomes, similar to what the pie charts show in the completed runs.