Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

Round2 batch fasta does not exist #100

Open niccw opened 3 years ago

niccw commented 3 years ago

This is the same issue reported in #82 , we ran RepeatModeler2 on a 1.1gb genome four times but still get the same issue. We also ran it using the docker image but still catch the same error.

The stdout:

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
   - Sequence extraction : 00:00:06 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
       1085 Tandem Repeats Masked
   - TRFMask time 00:00:43 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
  - Masking 1 - 5 of 81
  - Masking 16 - 30 of 81
  - Masking 41 - 65 of 81
  - Masking 76 - 81 of 81
   - TE Masking time 00:02:12 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 3033836 bp
       Num Contigs Represented = 73
       Non ambiguous bp:
             Initial: 3033836 bp
             After Masking: 2260649 bp
             Masked: 25.49 %
 -- Input Database Coverage: 3033836 bp out of 1044241066 bp ( 0.29 % )
Sampling Time: 00:03:13 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
        2% completed,  337:58:30 (hh:mm:ss) est. time remaining.

The stderr:

NCBIBlastSearchEngine::search: Error...query (/tmp/slurm-7480606/RM_167323.WedSep91109272020/round-2/batch-16.fa) does not exist!
 at /opt/RepeatModeler/RepeatModeler line 1392.

In the round-2 folder, there is only batch-16-gilist.txt but the fasta file is missing. The blastdbcmd.log is empty. We checked that the fasta file is valid, and also tried to filter short seqs but still get the same error.

Update: We tried RepeatModeler/1.0.10 on that genome and it ened successfully.

jebrosen commented 3 years ago

Is the genome publicly available? Since the error happened several times for you, hopefully it will be easy to to run into the error again and troubleshoot it.

The code around that error message has had barely any changes since 1.0.10, so it is a bit surprising to me that that worked but 2.0 didn't.

AlenaKizenko commented 3 years ago

Is the genome publicly available? Since the error happened several times for you, hopefully it will be easy to to run into the error again and troubleshoot it.

The code around that error message has had barely any changes since 1.0.10, so it is a bit surprising to me that that worked but 2.0 didn't.

This genome is not, but RepeatModeler v2 also failed on this ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100503/Ominor.genome.fasta.gz publicly available genome with the same error.

webbchen commented 3 years ago

I've trodden on the same error but at the start of round five, with all others finishing fine. Repeatmodeler and Repeatmasker are installed in a singularity container and have been run successfully in September but now failed twice in succession at the same point: stderr:

FastaDB::compact - Error could not locate file /tmp/annew/Fc_repmodeller_1/RM_143684.FriNov60954152020/round-5/sampleDB-5.fa!
 at /opt/RepeatModeler/RepeatModeler line 837.

stdout:

RepeatModeler Round  5   
=============    
Searching for Repeats    
-- Sampling from the database...   
- Gathering up to 82943514 bp  

It is run on the UK99 Fusarium culmorum genome. It did work on two other small genomes a few weeks ago perfectly fine.

rmhubley commented 3 years ago

This genome is not, but RepeatModeler v2 also failed on this ftp://parrot.genomics.cn/gigadb/pub/10.5524/100001_101000/100503/Ominor.genome.fasta.gz publicly available genome with the same error.

Do you happen to have the log file generated from this run still? I would like to know the random seed that was used.

rmhubley commented 3 years ago

I've trodden on the same error but at the start of round five, with all others finishing fine. Repeatmodeler and Repeatmasker are installed in a singularity container and have been run successfully in September but now failed twice in succession at the same point: stderr:

FastaDB::compact - Error could not locate file /tmp/annew/Fc_repmodeller_1/RM_143684.FriNov60954152020/round-5/sampleDB-5.fa!
 at /opt/RepeatModeler/RepeatModeler line 837.

stdout:

RepeatModeler Round  5   
=============    
Searching for Repeats    
-- Sampling from the database...   
- Gathering up to 82943514 bp  

It is run on the UK99 Fusarium culmorum genome. It did work on two other small genomes a few weeks ago perfectly fine.

If you provide a link to the sequence file and the random seed ( printed at the top of the log output ) that was used by the program I should be able to reproduce the issue.

goodgodric28 commented 3 years ago

I am actually running into exactly this problem with a different genome, also round 2, batch 16 on a 1.5 Gb genome with ~1100 scaffolds. I reran twice with the same outcome. Is there anything new on this issue?

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
   - Sequence extraction : 00:00:11 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
       2717 Tandem Repeats Masked
   - TRFMask time 00:00:57 (hh:mm:ss) Elapsed Time
 -- Masking repeats from the previous rounds...
  - Masking 1 - 5 of 76
  - Masking 16 - 30 of 76
  - Masking 41 - 65 of 76
  - Masking 76 - 76 of 76
   - TE Masking time 00:01:06 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 3007447 bp
       Num Contigs Represented = 26
       Non ambiguous bp:
             Initial: 3005162 bp
             After Masking: 1784605 bp
             Masked: 40.62 % 
 -- Input Database Coverage: 3007447 bp out of 1515217215 bp ( 0.20 % )
Sampling Time: 00:02:17 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
        2% completed,  00:18:17 (hh:mm:ss) est. time remaining.
NCBIBlastSearchEngine::search: Error...query (/sluglife/berghia_genome/dovetail_hic_Aug2020/Berghia_Aug2020_purgedups/repeatmodeler/RM_10308.WedJan270944222021/round-2/batch-16.fa) does not exist!
 at /home/sluglife/programs/RepeatModeler/RepeatModeler line 1392.
jebrosen commented 3 years ago

@goodgodric28 this is the latest status:

If you provide a link to the sequence file and the random seed ( printed at the top of the log output ) that was used by the program I should be able to reproduce the issue.

Is it possible for you to share that run's random seed number and a link to your sequence file for troubleshooting? (if it is public or can be shared with us)

goodgodric28 commented 3 years ago

@jebrosen Sure, but could you provide an email address to which I might send the link? I don't know that my PI will feel comfortable with me sharing in a public space.

jebrosen commented 3 years ago

@goodgodric28 Yes, you can send them to help@dfam.org. Thank you!

jebrosen commented 3 years ago

I have so far been unable to reproduce this error with the genome file and seed from @goodgodric28 . We will continue trying with other genomes/seeds as they are reported. For anyone with this issue, the following information may also be helpful to us in trying to track the problem down:


@webbchen I think your error is different. Your genome size and error message are the same as in #118 - can you try my suggestion in https://github.com/Dfam-consortium/RepeatModeler/issues/118#issuecomment-773442974 regarding -genomeSampleSizeMax 27000000 to stop after round 4?

Dichopsis commented 3 years ago

I have the same issue. I ran Repeatmodeler2 on a 1 gb genome (portunus trituberculatus) publicly available: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/017/591/435/GCA_017591435.1_ASM1759143v1/GCA_017591435.1_ASM1759143v1_genomic.fna.gz

NCBIBlastSearchEngine::search: Error...query (/data/scratch/testnico/Decapoda/portunus_trituberculatus/RM_58157.TueApr62325052021/round-2/batch-51.fa) does not exist! at /biolo/RepeatModeler/2.0.1/RepeatModeler line 1392.

The file round-2/blastdbcmd.log is empty. The files round-2/batch-51.fa, round-2/batch-51-gilist.txt, and round-2/batch-51-gilist don't exist.

Same issue with Chionoecetes opilio (2 gb):

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/584/305/GCA_016584305.1_ASM1658430v1/GCA_016584305.1_ASM1658430v1_genomic.fna.gz

@jebrosen Did you find something since the last issue ?

jebrosen commented 3 years ago

@Dichopsis I have so far not been able to reproduce this or find a reason for it to fail in this way, but I should be able to try those files this week. If you still have it, what was the Random Seed Number for these runs? (located near the beginning of RepeatModeler's main program output)

Dichopsis commented 3 years ago

@jebrosen Random Seed Number: Chionoecetes opilio: 1617114035 Portunus trituberculatus: run n°1: 1617702205 run n°2: 1617717267

jebrosen commented 3 years ago

I was able to reproduce the problem once with the Chionoecetes genome with that seed number, but it did not fail when I ran it again. A different program happened to run out of memory around the same time as the run that failed; I will look further into that and other possible causes of this issue.

percyfal commented 3 years ago

Hi @jebrosen, chiming in on this issue; I am experiencing similar problems, albeit on a very large genome (~20Gbp). The sheer size unfortunately makes it difficult to debug. I'll rerun with a different seed and see if I can contributed any additional information.

Cheers,

Per

KatharinaHoff commented 2 years ago

I have the same problem with Ammotragus lervia, genome is publicly available at NCBI (GCA_002201775.1_ALER1.0_genomic.fna). I removed all contigs shorter than 1000 nt from the genome.fa file.

       15% completed,  00:2:20 (hh:mm:ss) est. time remaining.
NCBIBlastSearchEngine::search: Error...query (RM_2069416.TueSep141207412021/round-2/batch-80.fa) does not exist!
 at RepeatModeler/RepeatModeler line 1430.

blastdbcmd.log is empty.

Only batch-80-gilist.txt exists, the other files for batch-80 are missing.

Any news on this, or do I have to continue randomly sampling the genome in order to escape the problem by chance? (Discarding the short sequences was a first step into that direction... didn't help.)

KatharinaHoff commented 2 years ago

Btw, when I run RM2 on the same Ammotragus on a different HPC with fewer threads, it does not crash at this step. (It then writes cd-hit stdout forever in terabyte size... see issue https://github.com/Dfam-consortium/RepeatModeler/issues/152). The difference between the two jobs in not in software versions. It's mainly the RAM and number of threads. The batch-80 missing problem occues on 72 threads while the forever cd-hit stdout occurs on 8 threads.