TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

data for pie chart proportions #105

Closed baligpanossian closed 5 months ago

baligpanossian commented 6 months ago

Hello, I ran EarlGrey on 21 assemblies but it only produced summary files for 5/21. All runs produced SAMPLE-families.fa files showing representative sequences for each family, but the log files show the following:

Trimming and sorting based on mreps, TRF, SA-SSR
Error: object 'trimmed_seq' not found
Execution halted
Removing temporary files
Reclassifying repeats
cp: cannot stat 'TS_515_WP-families.fa_7185/trf/515_WP-families.fa.nonsatellite': No such file or directory
/workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer
Compiling library
WARNING: TEstrainer failed to produce a strain file, please check the log file for more information. If you have run an intial mask with known repeats, this could be due RepeatModeler2 failing to identify any new repeats. Please check if this is expected.

              )  (
         (   ) )
         ) ( (
       _______)_
    .-'---------|  
       ( C|/\/\/\/\/|
    '-./\/\/\/\/|
     '_________'
      '-------'
    <<< Identifying Repeats Using Species-Specific Library >>>
RepeatMasker version 4.1.5
Search Engine: NCBI/RMBLAST [ 2.14.1+ ]
RepeatMasker::setspecies: Could not find user specified library /workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer/TS_515_WP-families.fa_7185/515_WP-families.fa.strained, or the file is empty.
ERROR: RepeatMasker failed, please check logs

This doesn't seem to be an assembly issue, because there were good and poor assembled sequences in both the successful runs and the failed runs.

I would appreciate any guidance on how/where to find the identified TEs in the temporary files in both successful and failed runs to calculate raw values for percentage of TEs covering the genomes, similar to what the pie charts show in the completed runs.

TobyBaril commented 6 months ago

In this case, TEstrainer has failed to produce a strained version of the input TE library. This could be because no non-satellite repeats were found in the RepeatModeler run, but this needs to be verified for the runs that failed. What version of Earl Grey are you using? Which OS? What flags did you use to run Earl Grey? Providing the whole log file should help to understand where the process has failed.

Verify that the -families.fa files contain sequences that you expect. Are any families missing or unexpected?

If you are happy to ignore any filtering steps, you can just rename *-families.fa to /workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer/TS_515_WP-families.fa_7185/515_WP-families.fa.strained and rerun Earl Grey with the same command to skip to the masking step. I strongly recommend against this as this runs the risk of TE annotations being wrong, as well as the consensus sequences not being refined

baligpanossian commented 6 months ago

Thank you for the prompt response. It is possible that these samples had no non-satellite repeats, but I'd want to check before ruling that out. Also, I'm now only trying to get the raw data with which the pie charts are generated. I've used EarlGrey 4.2.3 with a conda installation in linux. My command for this sample was EarlGrey -t 8 -g 515_WP.fasta -s 515 -o 515

The full log file is attached

515_WPEarlGrey.log

TobyBaril commented 6 months ago

The script that failed was the simple repeat filter trimming step in TEstrainer... @jamesdgalbraith any ideas on this one?

It looks like all previous steps of TEstrainer completed successfully, so it is something in the post-processing that has caused an issue

TobyBaril commented 6 months ago

Regarding the data used for the plots, this is in the _summaryFiles directory after a successful run - All the plots are generated using SAMPLE.filteredrepeats.gff

baligpanossian commented 6 months ago

Regarding the data used for the plots, this is in the _summaryFiles directory after a successful run - All the plots are generated using SAMPLE.filteredrepeats.gff

Thank you, this is perfect for the samples that ran successfully. As for those with an unsuccessful run that didn't generate anything in the _summaryFIles , can I manually calculate the data from another file upstream of the summary?

TobyBaril commented 6 months ago

You will need to run the final repeatmasker step, followed by the post-filtering. The easiest way to do this is to rename the families.fa with the .strained file, delete everything in mergedRepeats/ , RepeatMasker_Against_Custom_Library/ (but not the directories themselves) then run exactly the same command you did before, and Earl Grey will continue as if the TEstrainer step completed successfully.

I would still recommend trying to work out why the TEstrainer step failed, hopefully James could give us some more insight!

jamesdgalbraith commented 6 months ago

Sorry for the delay, I think I've identified the problem.

In the /workspace/fastas/515_WP/515_WP_EarlGrey/515_WP_strainer/TS_515_WP-families.fa_7185/TRF/ folder is there a fasta file tha ends with the extension .nonsatellite? If not I think that's the error and I'll need to patch this.

baligpanossian commented 6 months ago

Hello again, thank you for following up on this to try to find a fix. After attempting the workaround you suggested previously in this thread, I came across an error indicating that the .nonsatellite file you mention here is not found. I tried to fix this by adding a copy of the -families.fa file and renaming it to -families.fa.nonsatellite in the same directory (/TRF/) but it still didn't produce the summary files. Hope this helps narrow it down

TobyBaril commented 5 months ago

This is planned to be patched in the next release, which should be going live shortly - thanks for brining this up!