Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Running round-5 for two weeks #158

Open carolhsb opened 2 years ago

carolhsb commented 2 years ago

I am working with a fish genome of 1.3 Gb. I'm using Repeat Modeler on Docker.

I've run Repeat Modeler 2 and after two weeks running it ended, but the families.fa and families.stk files weren't generated. When I looked at the log file, round 5 was incomplete (run.out.txt).

I've deleted all round-5 files and started a new run with the -recoverDir flag. Now, I've been running round-5 only for more than two weeks (run2.out.txt) >.< And I would like to know if maybe there is an error because is taking too long to run it. I know Repeat Modeler takes a long time to run, but I don't know, it seems that something is wrong. run.out.txt run2.out.txt

Thanks in advance,

Carol

jebrosen commented 2 years ago

Hi Carol,

These timings do look a little bit unusual to me. I am surprised by how long several of the steps took - for example, in Round 2 it took an entire 8 hours to refine only 16 models. This could be due to the computing environment - what are the CPU speed and RAM available on the computer, and are you working from a local disk or a network directory? The problem could also be specific to this genome - maybe it has some very long repetitive sequences?

One thing you could try is to compare your runtimes to another genome with published timings, for example, those posted at https://github.com/Dfam-consortium/RepeatModeler/#repeatmodeler-statistics - some of them were measured at only a few hours. If your timings are dramatically slower on other genomes as well, a hardware or software issue or limitation seems more likely.

If the genome assembly you used is publicly available or can be shared with us, we could also try running the first few rounds of RepeatModeler on our computers to see if there is a problem or quirk with the particular assembly.

I hope this information is helpful! Let us know if we can help to troubleshoot this problem further.

carolhsb commented 2 years ago

Hi Jeb,

Thank you for your answer!

I'm running on our Ubuntu private server that has 32 processors and 276 GB RAM.

I can share my genome with you. How could I send it?

Cheers,

Carol

carolhsb commented 2 years ago

I sent you an email with the assembly.

Thanks!!!

jebrosen commented 2 years ago

Thanks, Carol! I have started a run on this assembly and I hope to have something to compare within a day or two.

jebrosen commented 2 years ago

Hi again, here is a portion of my log so far compared to yours. Some, but not all, of the steps took much longer to run on your system than on the one I tested. I ran RepeatModeler with -pa 4 on a mostly-unloaded system with 56 2.4GHz CPUs and 128GB RAM.

-Using output directory = /home/ale_assembly/RM_19.ThuOct210110202021
-Storage Throughput = good ( 761.08 MB/s )
+Using output directory = /work/repro1/RM_20.WedDec11820052021
+Storage Throughput = excellent ( 1269.36 MB/s )

@@ -32,13 +29,13 @@
    - Gathering up to 40000000 bp
    - Final Sample Size = 40083536 bp ( 40035409 non ambiguous )
    - Num Contigs Represented = 27
-   - Sequence extraction : 00:03:57 (hh:mm:ss) Elapsed Time
+   - Sequence extraction : 00:02:01 (hh:mm:ss) Elapsed Time
  -- Running RepeatScout on the sequences...
    - RepeatScout: Running build_lmer_table ( l = 14 )..
    - RepeatScout: Running RepeatScout.. : 1438 raw families identified
    - RepeatScout: Running filtering stage.. 1088 families remaining
-   - RepeatScout: 00:12:44 (hh:mm:ss) Elapsed Time
-   - Large Satellite Filtering.. : 3 found in 00:04:08 (hh:mm:ss) Elapsed Time
+   - RepeatScout: 00:12:29 (hh:mm:ss) Elapsed Time
+   - Large Satellite Filtering.. : 3 found in 00:00:05 (hh:mm:ss) Elapsed Time
    - Collecting repeat instances...
   - Removing sequence duplicate artefact ( chr23:30039137-30039231 )
   - Removing sequence duplicate artefact ( chr23:6778773-6778822 )
@@ -100,7 +97,7 @@
   - Removing sequence duplicate artefact ( chr21:10151431-10151507 )
   - Removing sequence duplicate artefact ( chr21:9027455-9027526 )
   - Removing sequence duplicate artefact ( chr8:19681224-19681335 )
-Family Refinement: 303:40:12 (hh:mm:ss) Elapsed Time
+Family Refinement: 01:00:47 (hh:mm:ss) Elapsed Time

 RepeatModeler Round # 2
@@ -108,15 +105,15 @@
 Searching for Repeats
  -- Sampling from the database...
    - Gathering up to 3000000 bp
-   - Sequence extraction : 00:02:13 (hh:mm:ss) Elapsed Time
+   - Sequence extraction : 00:00:09 (hh:mm:ss) Elapsed Time
  -- Running TRFMask on the sequence...
        1609 Tandem Repeats Masked
-   - TRFMask time 00:00:59 (hh:mm:ss) Elapsed Time
+   - TRFMask time 00:01:02 (hh:mm:ss) Elapsed Time
  -- Masking repeats from the previous rounds...
      - Masking 1 - 5 of 75
      - Masking 16 - 30 of 75
      - Masking 41 - 65 of 75
-   - TE Masking time 00:16:15 (hh:mm:ss) Elapsed Time
+   - TE Masking time 00:00:10 (hh:mm:ss) Elapsed Time
  -- Sample Stats:
        Sample Size 3000287 bp
        Num Contigs Represented = 25
@@ -125,19 +122,19 @@
              After Masking: 2130542 bp
              Masked: 28.99 %·
  -- Input Database Coverage: 3000287 bp out of 1276434479 bp ( 0.24 % )
-Sampling Time: 00:23:28 (hh:mm:ss) Elapsed Time
+Sampling Time: 00:01:21 (hh:mm:ss) Elapsed Time
 Running all-by-other comparisons...
-Comparison Time: 00:18:39 (hh:mm:ss) Elapsed Time, 6329 HSPs Collected
+Comparison Time: 00:00:22 (hh:mm:ss) Elapsed Time, 6329 HSPs Collected
   - RECON: Running imagespread..
 RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
   - RECON: Running initial definition of elements ( eledef )..
 RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
   - RECON: Running re-definition of elements ( eleredef )..
-RECON Elapsed: 00:00:06 (hh:mm:ss) Elapsed Time
+RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
   - RECON: Running re-definition of edges ( edgeredef )..
-RECON Elapsed: 00:00:14 (hh:mm:ss) Elapsed Time
+RECON Elapsed: 00:00:02 (hh:mm:ss) Elapsed Time
   - RECON: Running family definition ( famdef )..
-RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
+RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
   - Obtaining element sequences
 Number of families returned by RECON: 552
 Processing families with greater than 15 elements

Some steps took similar time, some ten times as long, and "Family Refinement" took three hundred times as long. I usually expect the most relevant factors here to be CPU and disk speeds, but the difference in storage throughput does not seem nearly enough to explain the difference in runtime. Do you happen to know if the system is under heavy load, or if any other programs are much slower than expected?

For completeness sake, could you also share your CPU speed and the type of filesystem you are running on? These can be found in head /proc/cpuinfo ("cpu MHz:") and findmnt -T /home/ale_assembly/RM_19.ThuOct210110202021 ("FSTYPE").

carolhsb commented 2 years ago

Hi Jeb,

Thank you for the feedback!

That's a significant difference between the runtimes in the Family Refinement o.O

The system is not on heavy load. Only the people from our lab use this server. And only Repeat Modeler 2 is running...

cpu MHz : 1199.438 FSTYPE: btrfs

Cheers,

Carol

jebrosen commented 2 years ago

Thanks for the followup! Despite the different CPU speeds and filesystem I see no particular reason (yet) for the dramatic speed difference. Past reports of slow runs often turned out to be over a network disk, or on badly misconfigured systems -- to the tune of "run every program on this computer at 10% of normal speed".

I looked again at the log, and it seems like many or all of the steps that ran more slowly involve the search engine (RMBlast). Another possibility I just thought of is that BLAST+ 2.11.0 introduced opt-out usage reporting (http://www.repeatmasker.org/RMBlast.html, "RMBlast Usage Statistics Reporting"). If you are interested in trying it, this can be disabled by setting the environment variable BLAST_USAGE_REPORT=false inside the container - e.g. with docker run -e BLAST_USAGE_REPORT=false ..., or with export BLAST_USAGE_REPORT=false if you are running an interactive container. Given the original log, it would hopefully be apparent within just a few hours whether that improves things.

Otherwise, I can make and share a few test files and scripts to help try and narrow down which particular program(s) are performing poorly.

carolhsb commented 2 years ago

Hi Jeb,

Do you think it's worthwhile to start running again with the flag export BLAST_USAGE_REPORT=false?

I would be very grateful if you could make and share with me some scripts to try to solve this :)

jebrosen commented 2 years ago

Hi Carol,

I do not think it could hurt to try. According to the logs, Round-1 alone took over 12 days, compared to an hour or two on our server computer.

I started working on a test script to time particular programs more quickly than a whole RepeatModeler run, but I have a few more files left to add to it before it will be ready.

carolhsb commented 2 years ago

Hi Jeb!

Thank you so much for the script!

These were my times:

With BLAST_USAGE_REPORT= build_lmer_t 29 RepeatScout 582 LargeSat_mak 120 LargeSat_run 123 Refiner 1773 TRFMask 37

With BLAST_USAGE_REPORT=false build_lmer_t 28 RepeatScout 585 LargeSat_mak 121 LargeSat_run 123 Refiner 1772 TRFMask 36

There isn't a big difference in runtime between the blast flags. My Refiner runtime is extremely longer than yours. What could cause that?

Many thanks!

Carol

jebrosen commented 2 years ago

For anyone else interested: I corrected a mistake in my test program. The fixed test showed that BLAST_USAGE_REPORT=false does make a significant difference to the RMBlast runtime. In particular, LargeSat_mak and LargeSat_run both become 120 seconds faster and Refiner became significantly faster as well.

The exact times suggest a 2-minute timeout in the usage reporting before giving up; this feature is not essential, so the timeout period should be much lower if that is the problem. We will continue to investigate the exact details, but in the meantime BLAST_USAGE_REPORT=false appears to be an adequate workaround for this specific problem.