Closed jaehakson closed 5 months ago
Hi, in this case it looks like RepeatModeler failed - eleredef failed. Exit code 11
. The -RecoverDir
only looks to see if an intact run can be restarted, so won't recover a failed run in this instance. It is difficult to determine why RECON failed in this case...it could just be a bad seed (in which case a fresh run might work), but it seems there is a permission issue with rmod.log
.
Is this being run on a queuing system? Where is RepeatModeler installed (conda environment, or manual install)? In this case, it looks like RepeatModeler2 is trying to write a log to root /
, which is definitely going to cause some permission issues for the run, likely causing it to fail.
Yes, it is run on a queuing system (slurm). I installed miniconda in my home directory and then earlgrey was installed with the conda installed in my home directory. And so earlgrey environment is located within miniconda env directory of my home directory.
repeatmodeler is also in the earlgrey environment.
Jae
Exit code 11 usually indicates a segmentation fault in unix systems. Potential causes for this in a slurm system could be using too much memory or not being given enough cores. Generally, repeat annotation on larger genomes will require a high-memory node to prevent being killed by the queuing system.
I would recommend trying a fresh run. Alternatively, the Docker container may work better depending on the architecture of your HPC and queuing system
Thanks for the comment. maybe I should try containers, docker or singularity.
In addition, when I ran earlgrey with asmall genome (about 200Mbp), all of the final output were not created in *_summaryFiles directory. only three files are created. I did run it several times and got only three files in the directory all the time.
I attached the log file here (I cut out some of part because of size limit).
The error has occurred in the post-filtering step:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 940, saw 10
Have you got spaces or strange characters in your FASTA header names in the input file? If so, this will cause some methods to fail.
I recommend checking line 940 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered
to see if there is something strange about this line which could help to debug.
Hmm. I tried to figure out the errors but I could not. First of all, I parsed the headers of the input fasta file, in the way below. ">JAEIHA010000001.1 Zaprionus indianus isolate RCR04 contig_1, whole genome shotgun sequence" -> ">JAEIHA010000001.1" and then I ran earlgrey in conda environment. but I got the same errors before.
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 954, saw 10
And then I checked line 954 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered nothing weird showed up. In line 953, 954, 955, I found only 9 fields, not 10.. (see below for line 953~955).
ctg_1016 RepeatMasker LTR/Gypsy 439913 443130 23901 - NA Tstart=1582;Tend=4584;ID=RND-1_FAMILY-240;shortTE=F;LTRgroup=ctg_1016_g6;TEgroup=ctg_1016|RND-1_FAMILY-240|4 ctg_1016 RepeatMasker LTR/Pao 443132 444130 8932 - NA Tstart=3496;Tend=4512;ID=RND-1_FAMILY-189;shortTE=F;LTRgroup=ctg_1016_g6,ctg_1016_g7 ctg_1016 RepeatMasker LTR/Gypsy 444131 444302 1270 - NA Tstart=5232;Tend=5406;ID=RND-4_FAMILY-1454;shortTE=F;LTRgroup=ctg_1016_g7
Below is the part of the log file
########################################################
<<< Resolving Overlapping Repeats >>>
Warning messages:
1: package ‘GenomicRanges’ was built under R version 4.3.3
2: package ‘BiocGenerics’ was built under R version 4.3.2
3: package ‘S4Vectors’ was built under R version 4.3.3
4: package ‘IRanges’ was built under R version 4.3.3
5: package ‘GenomeInfoDb’ was built under R version 4.3.2
Warning message:
package ‘ape’ was built under R version 4.3.3
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//filteringOverlappingRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.sorted"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
Warning message:
package ‘data.table’ was built under R version 4.3.3
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//mergeRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.bed"
[8] "197260855"
[9] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.revisedTable"
[10] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed"
[11] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.summary"
[12] "no"
Traceback (most recent call last):
File "/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//backSwapGFF.py", line 14, in
mv: cannot stat ‘/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered.2’: No such file or directory
Warning messages:
1: package ‘ggplot2’ was built under R version 4.3.3
2: package ‘tidyr’ was built under R version 4.3.2
3: package ‘readr’ was built under R version 4.3.2
4: package ‘dplyr’ was built under R version 4.3.2
5: package ‘stringr’ was built under R version 4.3.2
[1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//makeGff.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
[8] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.gff"
Error in $<-.data.frame
(*tmp*
, V8, value = ".") :
replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
Execution halted
) (
( ) )
) ( (
_______)_
.-'---------|
( C|/\/\/\/\/|
'-./\/\/\/\/|
'_________'
'-------'
<<< Done! >>>
I figure out the issue and solved it. In the line 14 of the file "miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts/backSwapGFF.py", I changed the separator (\s+) as "\t" and then I've got the entire output of earlgrey.
Maybe is this a typo in the code?
This is odd - I haven't been able to reproduce this bug on any of the machines here (multiple linux and mac systems). If this works for you, then happy it is a good solution!
Hi Toby,
I am trying to run earlyGrey (by conda installed) with two genomes. one genome is small (200Mb) and the other one is big (1.3Gb). When running it with the big genome, I got errors. I guess that error occurs by output of RepeatModeler.
My earlGrey run stopped at the stage of repeatmodeler because repeatmodeler did not create .claasified the RepeaModelr directory in and did not copy -families.fa, -familes.stk, -rmod.log in the Database directory.
When I re-run repeatmodeler with -recoverDir otpion, it said that repeatmodeler successfully run. However, it did not create and copy the necessary files for the downstream running. and I got stuck in the step with a big genome. With a small genome, there is no problem.
I think that I can manually create *.classified file using RepeatClassifier and then copy the appropriate file into the Database directory. And then I will use the same earlGrey command with the big genome.
I wonder if this way works without issues and creates the same earlGrey outputs.
Below is the log file for the big genome.
Building database housefly_aabys: Reading /scratch/js3054/housefly/ragtag_option/scaff_hifi_hic/3d-dna/post_review/base_HiC.fasta.prep... Number of sequences (bp) added to database: 502 ( 1357786862 bp ) RepeatModeler Version 2.0.5
Using output directory = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024 Search Engine = rmblast 2.14.1+ Threads = 32 Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5 LTR Structural Analysis: Disabled [use -LTRStruct to enable] Random Number Seed: 1715479967 Database = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_Database/housefly_aabys .
Contig Histogram: Size(bp) Count
230076028-246509918 | [ 2 ] 213642138-230076027 | [ ] 197208248-213642137 | [ 2 ] 180774358-197208247 | [ ] 164340469-180774358 | [ 1 ] 147906579-164340468 | [ ] 131472689-147906578 | [ ] 115038799-131472688 | [ ] 98604909-115038798 | [ 1 ] 82171020-98604909 | [ 1 ] 65737130-82171019 | [ ] 49303240-65737129 | [ 1 ] 32869350-49303239 | [ ] 16435460-32869349 | [ ] 1571-16435460 |***** [ 494 ]
Storage Throughput = excellent ( 1828.62 MB/s )
Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.
RepeatModeler Round # 1 . . . Comparison Time: 06:42:39 (hh:mm:ss) Elapsed Time, 564088 HSPs Collected