Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

sort failed for images/spread1 #228

Open tanpham15 opened 7 months ago

tanpham15 commented 7 months ago

Hi,

First run I used the command bellow with 100hours resource, then it couldn't finish: RepeatModeler -database polished_dna_B_hypochlora \ -engine ncbi then the second run I used the command: RepeatModeler -database polished_dna_B_hypochlora \ -engine ncbi -srand 81000000

Here is the error report after 5 rounds. Please see the detail report in the attachment log_dna_repeat_modeler.sh.txt

Could you please let me know how can I correct the command? As I understand if I use "-srand", the second round will start from this value. Does it correct? If not can we re-run a previous task?

Thank you very much for your time. Best regards

RepeatModeler Round # 1
========================
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40020567 bp ( 40020567 non ambiguous )
   - Num Contigs Represented = 36
................................................
   - RepeatScout: Running RepeatScout..
Program duration is 690.0 sec = 11.5 min = 0.2 hr
NOTE: RepeatScout did not return any models.

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
 -- Running TRFMask on the sequence...
 -- Sample Stats:
       Sample Size 3031569 bp
       Num Contigs Represented = 18
................................................
Number of families returned by RECON: 1059
Processing families with greater than 15 elements

RepeatModeler Round # 3
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 9000000 bp
 -- Running TRFMask on the sequence...
 -- Sample Stats:
       Sample Size 9005018 bp
................................................
Number of families returned by RECON: 3712
Processing families with greater than 15 elements

RepeatModeler Round # 4
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 27000000 bp
 -- Running TRFMask on the sequence...
 -- Sample Stats:
       Sample Size 27023928 bp
       Num Contigs Represented = 31
................................................
Number of families returned by RECON: 12893
Processing families with greater than 15 elements

RepeatModeler Round # 5
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 81000000 bp
 -- Running TRFMask on the sequence...
 -- Sample Stats:
       Sample Size 81011127 bp
......................................................
  - RECON: Running imagespread..
RECON Elapsed: 00:00:26 (hh:mm:ss) Elapsed Time
sh: line 1: 256578 Killed                  sort -k 3,3 -k 4n,4n -k 5nr,5nr images/spread1 >> images/images_sorted
sort failed for images/spread1.
rmhubley commented 7 months ago

The prefix "sh: line1: 256578 Killed" is not a bug but a memory issue on your system. I suspect it you didn't have enough memory to perform this unix sort command. It's a bit strange though since it looks like the genome you are providing is really quite low in repeats (RepeatScout didn't return any families - Round 1). In any case, you are correct, you can restart this from Round 4 using the -recoverDir option. You must provide it the name of the temporary directory that was created in your first run. It is usually printed in the log output at the start and looks similar to "RM_3887321.WedOct41005232023".

tanpham15 commented 7 months ago

Thank you very much Hubley for your explanation.

Do you mean the RAM memory? I requested only 2GB for the last run because it only used 1.7gb.

As you can see I already finished running for the round 4. By providing the temporary directory, can I run from this command sort -k 3,3 -k 4n,4n -k 5nr,5nr images/spread1 >> images/images_sorted?

Round 4 took more than 80hours, so how can I save this time?

Best regards,