jaehakson commented 6 months ago

Hi Toby,

I am trying to run earlyGrey (by conda installed) with two genomes. one genome is small (200Mb) and the other one is big (1.3Gb). When running it with the big genome, I got errors. I guess that error occurs by output of RepeatModeler.

My earlGrey run stopped at the stage of repeatmodeler because repeatmodeler did not create .claasified the RepeaModelr directory in and did not copy -families.fa, -familes.stk, -rmod.log in the Database directory.

When I re-run repeatmodeler with -recoverDir otpion, it said that repeatmodeler successfully run. However, it did not create and copy the necessary files for the downstream running. and I got stuck in the step with a big genome. With a small genome, there is no problem.

I think that I can manually create *.classified file using RepeatClassifier and then copy the appropriate file into the Database directory. And then I will use the same earlGrey command with the big genome.

I wonder if this way works without issues and creates the same earlGrey outputs.

Below is the log file for the big genome.

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Cleaning Genome >>>

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Detecting Novel Repeats >>>

Building database housefly_aabys: Reading /scratch/js3054/housefly/ragtag_option/scaff_hifi_hic/3d-dna/post_review/base_HiC.fasta.prep... Number of sequences (bp) added to database: 502 ( 1357786862 bp ) RepeatModeler Version 2.0.5

Using output directory = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024 Search Engine = rmblast 2.14.1+ Threads = 32 Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5 LTR Structural Analysis: Disabled [use -LTRStruct to enable] Random Number Seed: 1715479967 Database = /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_Database/housefly_aabys .

Sequences = 502
Bases = 1357786862
N50 = 231866926
Contig Histogram: Size(bp) Count

230076028-246509918 | [ 2 ] 213642138-230076027 | [ ] 197208248-213642137 | [ 2 ] 180774358-197208247 | [ ] 164340469-180774358 | [ 1 ] 147906579-164340468 | [ ] 131472689-147906578 | [ ] 115038799-131472688 | [ ] 98604909-115038798 | [ 1 ] 82171020-98604909 | [ 1 ] 65737130-82171019 | [ ] 49303240-65737129 | [ 1 ] 32869350-49303239 | [ ] 16435460-32869349 | [ ] 1571-16435460 |***** [ 494 ]

Storage Throughput = excellent ( 1828.62 MB/s )

Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/housefly_aabys_EarlGrey/housefly_aabys_RepeatModeler/RM_64325.SatMay112212482024/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.

RepeatModeler Round # 1 . . . Comparison Time: 06:42:39 (hh:mm:ss) Elapsed Time, 564088 HSPs Collected

RECON: Running imagespread.. RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
RECON: Running initial definition of elements ( eledef ).. RECON Elapsed: 00:00:38 (hh:mm:ss) Elapsed Time
RECON: Running re-definition of elements ( eleredef ).. eleredef failed. Exit code 11 ERROR: RepeatModeler Failed, Retrying with limit set as Round 5 Could not open up /rmod.log for writing! ERROR: RepeatModeler Failed, Retrying with limit set as Round 4 Could not open up /rmod.log for writing! ERROR: RepeatModeler Failed

TobyBaril commented 6 months ago

Hi, in this case it looks like RepeatModeler failed - eleredef failed. Exit code 11. The -RecoverDir only looks to see if an intact run can be restarted, so won't recover a failed run in this instance. It is difficult to determine why RECON failed in this case...it could just be a bad seed (in which case a fresh run might work), but it seems there is a permission issue with rmod.log.

Is this being run on a queuing system? Where is RepeatModeler installed (conda environment, or manual install)? In this case, it looks like RepeatModeler2 is trying to write a log to root /, which is definitely going to cause some permission issues for the run, likely causing it to fail.

jaehakson commented 6 months ago

Yes, it is run on a queuing system (slurm). I installed miniconda in my home directory and then earlgrey was installed with the conda installed in my home directory. And so earlgrey environment is located within miniconda env directory of my home directory.

repeatmodeler is also in the earlgrey environment.

Jae

TobyBaril commented 6 months ago

Exit code 11 usually indicates a segmentation fault in unix systems. Potential causes for this in a slurm system could be using too much memory or not being given enough cores. Generally, repeat annotation on larger genomes will require a high-memory node to prevent being killed by the queuing system.

I would recommend trying a fresh run. Alternatively, the Docker container may work better depending on the architecture of your HPC and queuing system

jaehakson commented 5 months ago

Thanks for the comment. maybe I should try containers, docker or singularity.

In addition, when I ran earlgrey with asmall genome (about 200Mbp), all of the final output were not created in *_summaryFiles directory. only three files are created. I did run it several times and got only three files in the directory all the time.

TE annotations in GFF3 and BED format
de novo repeat library in FASTA format
Combined repeat library in FASTA format (OPTIONAL)

I attached the log file here (I cut out some of part because of size limit).

earlgrey.log

TobyBaril commented 5 months ago

The error has occurred in the post-filtering step:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 940, saw 10

Have you got spaces or strange characters in your FASTA header names in the input file? If so, this will cause some methods to fail.

I recommend checking line 940 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered to see if there is something strange about this line which could help to debug.

jaehakson commented 5 months ago

Hmm. I tried to figure out the errors but I could not. First of all, I parsed the headers of the input fasta file, in the way below. ">JAEIHA010000001.1 Zaprionus indianus isolate RCR04 contig_1, whole genome shotgun sequence" -> ">JAEIHA010000001.1" and then I ran earlgrey in conda environment. but I got the same errors before.

pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 954, saw 10

And then I checked line 954 in ${species}_EarlGrey/${species}_mergedRepeats/looseMerge/*.rmerge.gff.filtered nothing weird showed up. In line 953, 954, 955, I found only 9 fields, not 10.. (see below for line 953~955).

ctg_1016 RepeatMasker LTR/Gypsy 439913 443130 23901 - NA Tstart=1582;Tend=4584;ID=RND-1_FAMILY-240;shortTE=F;LTRgroup=ctg_1016_g6;TEgroup=ctg_1016|RND-1_FAMILY-240|4 ctg_1016 RepeatMasker LTR/Pao 443132 444130 8932 - NA Tstart=3496;Tend=4512;ID=RND-1_FAMILY-189;shortTE=F;LTRgroup=ctg_1016_g6,ctg_1016_g7 ctg_1016 RepeatMasker LTR/Gypsy 444131 444302 1270 - NA Tstart=5232;Tend=5406;ID=RND-4_FAMILY-1454;shortTE=F;LTRgroup=ctg_1016_g7

Below is the part of the log file ######################################################## <<< Resolving Overlapping Repeats >>> Warning messages: 1: package ‘GenomicRanges’ was built under R version 4.3.3 2: package ‘BiocGenerics’ was built under R version 4.3.2 3: package ‘S4Vectors’ was built under R version 4.3.3 4: package ‘IRanges’ was built under R version 4.3.3 5: package ‘GenomeInfoDb’ was built under R version 4.3.2 Warning message: package ‘ape’ was built under R version 4.3.3 Warning messages: 1: package ‘ggplot2’ was built under R version 4.3.3 2: package ‘tidyr’ was built under R version 4.3.2 3: package ‘readr’ was built under R version 4.3.2 4: package ‘dplyr’ was built under R version 4.3.2 5: package ‘stringr’ was built under R version 4.3.2 [1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//filteringOverlappingRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.sorted"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered" Warning messages: 1: package ‘ggplot2’ was built under R version 4.3.3 2: package ‘tidyr’ was built under R version 4.3.2 3: package ‘readr’ was built under R version 4.3.2 4: package ‘dplyr’ was built under R version 4.3.2 5: package ‘stringr’ was built under R version 4.3.2 Warning message: package ‘data.table’ was built under R version 4.3.3 [1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//mergeRepeats.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered"
[7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.bed"
[8] "197260855"
[9] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.mergedRepeats.revisedTable" [10] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed"
[11] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.summary"
[12] "no"
Traceback (most recent call last): File "/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//backSwapGFF.py", line 14, in table = pd.read_csv(input, names = ['scaf', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], sep='\s+', header = None) File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 617, in _read return parser.read(nrows) File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1748, in read ) = self._engine.read( # type: ignore[attr-defined] File "/home/js3054/.local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 954, saw 10

mv: cannot stat ‘/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered.2’: No such file or directory Warning messages: 1: package ‘ggplot2’ was built under R version 4.3.3 2: package ‘tidyr’ was built under R version 4.3.2 3: package ‘readr’ was built under R version 4.3.2 4: package ‘dplyr’ was built under R version 4.3.2 5: package ‘stringr’ was built under R version 4.3.2 [1] "/home/js3054/miniconda3/envs/earlgrey/lib/R/bin/exec/R"
[2] "--no-echo"
[3] "--no-restore"
[4] "--file=/home/js3054/miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//makeGff.R"
[5] "--args"
[6] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.bed" [7] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.rmerge.gff.filtered" [8] "/projectsp/f_cee53_1/ellison_lab/JaeHakSon/repeats/earlgrey/Z.indianus_4595_EarlGrey/Z.indianus_4595_mergedRepeats/looseMerge/Z.indianus_4595.filteredRepeats.gff" Error in $<-.data.frame(*tmp*, V8, value = ".") : replacement has 1 row, data has 0 Calls: $<- -> $<-.data.frame Execution halted

          )  (
     (   ) )
     ) ( (
   _______)_
.-'---------|  
   ( C|/\/\/\/\/|
'-./\/\/\/\/|
 '_________'
  '-------'
<<< Done! >>>

jaehakson commented 5 months ago

update on the previous comment.

I figure out the issue and solved it. In the line 14 of the file "miniconda3/envs/earlgrey/share/earlgrey-4.2.4-0/scripts/backSwapGFF.py", I changed the separator (\s+) as "\t" and then I've got the entire output of earlgrey.

Maybe is this a typo in the code?

TobyBaril commented 5 months ago

This is odd - I haven't been able to reproduce this bug on any of the machines here (multiple linux and mac systems). If this works for you, then happy it is a good solution!

TobyBaril / EarlGrey

RepeatModeler run successfully, but did not create .classified file and -families.fa, and so stopped the earlGrey. #108

Building database housefly_aabys: Reading /scratch/js3054/housefly/ragtag_option/scaff_hifi_hic/3d-dna/post_review/base_HiC.fasta.prep... Number of sequences (bp) added to database: 502 ( 1357786862 bp ) RepeatModeler Version 2.0.5

Contig Histogram: Size(bp) Count

update on the previous comment.

TobyBaril / EarlGrey

RepeatModeler run successfully, but did not create *.classified file and *-families.fa, and so stopped the earlGrey. #108

Building database housefly_aabys: Reading /scratch/js3054/housefly/ragtag_option/scaff_hifi_hic/3d-dna/post_review/base_HiC.fasta.prep... Number of sequences (bp) added to database: 502 ( 1357786862 bp ) RepeatModeler Version 2.0.5

Contig Histogram: Size(bp) Count

update on the previous comment.

RepeatModeler run successfully, but did not create .classified file and -families.fa, and so stopped the earlGrey. #108