I tried to put a cactus ancestor through Red RepeatMasking via the cactus preprocessor and it crashed right away -- something I haven't seen on any real or test data so far.
After some trial and error, it looks like Red will crash if the input contains a contig that is
tiny. not sure the exact limit but 20kb seems to work on tests (though on real data it can mask much smaller sequences at least sometimes)
or really low-information. for example, a contig that is just N's will crash Red no matter how long it is.
This PR adds a prefilter to catch these cases (it would eventually be nice to get into Red's code to fix it properly). Contigs that are smaller than 20kb or which are more than 98% a single base are filtered out before Red then added back in after. In the second case, the giant monomer runs in the contig are softmasked before being added back.
I tried to put a cactus ancestor through Red RepeatMasking via the cactus preprocessor and it crashed right away -- something I haven't seen on any real or test data so far.
After some trial and error, it looks like Red will crash if the input contains a contig that is
N
's will crash Red no matter how long it is.This PR adds a prefilter to catch these cases (it would eventually be nice to get into Red's code to fix it properly). Contigs that are smaller than
20kb
or which are more than98%
a single base are filtered out beforeRed
then added back in after. In the second case, the giant monomer runs in the contig are softmasked before being added back.