Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

custom de novo library failing - batch failures and "Comparison failed. Retrying with larger minmatch (10)" #124

Closed kcastellano13 closed 1 month ago

kcastellano13 commented 3 years ago

Hi, I am having trouble getting RepeatMasker to run with a custom de novo library built by combining repeats from multiple programs (RepeatModeler plus a few others). I keep getting batch failures and "WARNING: Comparison failed. Retrying with larger minmatch (10)" (see error logs below) . It always fails on different batches but I have pulled out some of them and nothing looks off to me. I built the library with the same method for the genome of a sister species and it ran with no problem. This genome is slightly larger (900Mb genome size) but less fragmented than the sister species. I should mention that both have a high repeat content (both ~68% with the RepeatModeler de novo library, the sister species is ~80% when masked with my custom de novo library). For troubleshooting so far I have: 1) cut the headers down to < 50 characters (I classified with repeatclassifer so all headers have a classification) 2) split the genome and run it on ~200 sequences 3) used the flag "-frag 1000" on the full genome and one of the split files - all of which did not work. I was able to run it successfully on one contig and I was able to run RepeatMasker successfully on this genome with the de novo library from RepeatModeler only but I need to mask with my custom library. I attached the full error logs from my most recent run. Any help would be greatly appreciated! Kate combinedLib_split0_3234109.out.txt combinedLib_split0_3234109.err.txt

jebrosen commented 3 years ago

Hi, sorry to hear you are having this issue.

To get some additional information about the error that happened, could you re-run one of those commands listed for the "engine parameters" in a failed run - and keep the full output in a file? For example,

.../cross_match -alignments -gap_init -30 -ins_gap_ext -6 -del_gap_ext -5 -minmatch 10 -minscore 225 -bandwidth 14 -masklevel 101 -matrix .../20p39g.matrix .../...batch-25.masked .../...library.fa.classified >cm_output.txt 2>&1

The cm_output.txt file should have a more detailed error, that may solve the problem or at least point in the right direction.

This problem might be specific to cross_match; another search engine may work on this particular file if that is an option for you.

kcastellano13 commented 3 years ago

Hi Jeb,

Thanks for responding so quickly! I attached the cm_output.txt file for you to see. It does look like an issue with crossmatch where it is getting a score discrepancy for some reason. So, I tried rmblast as the search engine on one of my split files with and without the -frag 1000 flag and both completed successfully so I think that was the problem and I should be okay moving forward.

Thank you again! cm_output.txt

jebrosen commented 3 years ago

That is strange. I suspect either a bug in cross_match, or a misleading error message for input data it can't accept for some reason. Unfortunately it looks like the issue might be pretty deep in one of the underlying algorithms. If you are willing and able to provide us with the batch-25.masked and library.fa.classified files (attached, or via email to help@repeatmasker.org), we and/or cross_match may be able to find or troubleshoot the problem more specifically.

Either way, I am glad to hear that RMBlast worked!