bioconda / bioconda-recipes

Conda recipes for the bioconda channel.
https://bioconda.github.io
MIT License
1.61k stars 3.21k forks source link

MAKER and RepeatMasker don't work together because famdb.py requires python 3 #25559

Closed shjenkins94 closed 3 years ago

shjenkins94 commented 3 years ago

The newest version of RepeatMasker 4.1.1 uses h5py and requires python 3, but this is not listed in the dependencies.

Since MAKER requires python 2 and doesn't specify which version of RepeatMasker to install, It downloads the newest version. It seems like the main conflict is because of the famdb.py script that was added to RepeatMasker. It has the shebang #!/usr/bin/env python3 and running it with python 2 doesn't work because of grammar issues.

nathanweeks commented 3 years ago

For MAKER, I suspect python 2.7.15 is a transitive dependency via augustus 3.3.3_pl526h0faeac2_5 -> biopython 1.76_py27h516909a_0, in which case bumping the maker version to get a more-recent augustus build (with a more-recent Python 3.x biopython build) should do the trick.

bernt-matthias commented 3 years ago

I guess

Species "anopheles" is not known to RepeatMasker.  There may
not be any TE families defined in the libraries for this
species/clade or there may be an error in the spelling.
Please check your entry against the NCBI Taxonomy database
and/or try using a broader clade or related species instead.
The full list of species/clades defined in the library may be
obtained using the famdb.py script.

is related?

shjenkins94 commented 3 years ago

I think the dependency that requires python 2 is mir-prefer, but it looks like you figured that out in the pull request.

shjenkins94 commented 3 years ago

@bernt-matthias Are you perhaps using RepBase? I was wondering since I'm getting a slightly different issue with DFAM.

I tried using the newer code but kept getting "ERROR: Could not determine if RepBase is installed" so I started poking around in MAKER's code. It looks like MAKER uses a perl module GI.pm that gets installed in $CONDA_PREFIX/lib/GI.pm. In that I found:

#--make sure repbase is installed
   if($CTL_OPT{model_org} and !defined($ENV{'LIBDIR'})){
       my $exe = Cwd::abs_path($CTL_OPT{RepeatMasker});
       my ($lib) = $exe =~ /(.*\/)RepeatMasker$/;
       die "ERROR: Could not determine if RepBase is installed\n" if(! $lib);

       $lib .= "../share/RepeatMasker/Libraries/RepeatMaskerLib.embl";
       die "ERROR: Could not determine if RepBase is installed\n" if(! -f $lib);

       open(my $IN, "< $lib");
       my $rb_flag;
       for(my $i = 0; $i < 20; $i++){
           my $line = <$IN>;
           if($line =~ /RELEASE \d+(\-min)?\;/){
               $rb_flag = ($1 && $1 eq '-min') ? 0 : 1;
               last;
           }
       }
       close($IN);

       if(! $rb_flag){
           warn "WARNING: RepBase is not installed for RepeatMasker. This limits\n".
               "RepeatMasker's functionality and makes the model_org option in the\n".
               "control files virtually meaningless. MAKER will now reconfigure\n".
               "for simple repeat masking only.\n";
           $CTL_OPT{model_org} = 'simple';
       }
   }

So it seems like MAKER is failing because RepeatMasker doesn't create RepeatMaskerLib.embl anymore and this part kills MAKER if it doesn't exist.

$lib .= "../share/RepeatMasker/Libraries/RepeatMaskerLib.embl"; die "ERROR: Could not determine if RepBase is installed\n" if(! -f $lib);

bernt-matthias commented 3 years ago

Thanks for digging into this. This seems to be the case https://github.com/galaxyproject/tools-iuc/blob/db75a8489a1f61ea30abe9b91f6febac8b34204f/tools/maker/maker.xml#L394

nathanweeks commented 3 years ago

The repeatmasker==4.1.1 bioconda package generates a RepeatMaskerLib.h5 symlink:

$ singularity exec quay.io_biocontainers_maker_2.31.11--pl526h61907ee_0-2020-12-02-c14814e811b3.sif ls -l /usr/local/share/RepeatMasker/Libraries/
total 2166530
-rwxrwxr-x 1 root root      25283 Nov 23 14:26 Artefacts.embl
-rw-rw-r-- 1 root root 2011886880 Nov 23 14:26 Dfam.h5
-rw-rw-r-- 1 root root        214 Nov 23 14:26 README.meta
-rwxrwxr-x 1 root root   22475384 Nov 23 14:26 RepeatAnnotationData.pm
-rw-rw-r-- 1 root root   10955446 Nov 23 14:27 RepeatMasker.lib
lrwxrwxrwx 1 root root          7 Dec  2 17:31 RepeatMaskerLib.h5 -> Dfam.h5
-rw-rw-r-- 1 root root     674815 Nov 23 14:27 RepeatMasker.lib.nhr
-rw-rw-r-- 1 root root      83808 Dec  2 16:52 RepeatMasker.lib.nin
-rw-rw-r-- 1 root root    3095721 Nov 23 14:27 RepeatMasker.lib.nsq
-rw-rw-r-- 1 root root   17979984 Nov 23 14:26 RepeatPeps.lib
-rw-rw-r-- 1 root root    2931407 Nov 23 14:27 RepeatPeps.lib.phr
-rw-rw-r-- 1 root root     144448 Dec  2 16:52 RepeatPeps.lib.pin
-rw-rw-r-- 1 root root   16168295 Nov 23 14:27 RepeatPeps.lib.psq
-rw-rw-r-- 1 root root       5550 Nov 23 14:26 RepeatPeps.readme
-rw-rw-r-- 1 root root   18752245 Nov 23 14:26 RMRBMeta.embl
-rw-rw-r-- 1 root root  113343436 Nov 23 14:26 taxonomy.dat

MAKER is hard-coded to check for RepeatMaskerLib.embl---unless the LIBDIR environment variable is set (this was previously REPEATMASKER_LIB_DIR in both the bioconda maker & repeatmasker <= 4.1; changed to LIBDIR in this commit to align with the upstream RepeatMasker 4.1.x). So currently, the LIBDIR environment variable will have to be specified (although I guess another alternative would be to update recipes/maker/repeatmasker_check.patch to simply remove the code block?)

shjenkins94 commented 3 years ago

I tried replacing

       $lib .= "../share/RepeatMasker/Libraries/RepeatMaskerLib.embl";
       die "ERROR: Could not determine if RepBase is installed\n" if(! -f $lib);

with

       $lib .= "Libraries/RepeatMaskerLib.h5";
       die "ERROR: Could not determine if RepBase is installed\n" if(! -f $lib);

Which seems to work, but I'm not sure how robust that fix is.

I guess another problem is checking for RepBase in the first place. Since RepBase isn't free anymore, there are probably a lot of people like me who use Dfam instead.

nathanweeks commented 3 years ago

However, then MAKER would attempt to read the RepeatMaskerLib.h5 file as if it were a text file (rather than an HDF5 file) to get some version information:

       open(my $IN, "< $lib");
       my $rb_flag;
       for(my $i = 0; $i < 20; $i++){
           my $line = <$IN>;
           if($line =~ /RELEASE \d+(\-min)?\;/){
               $rb_flag = ($1 && $1 eq '-min') ? 0 : 1;
               last;
           }
       }

If it's undesirable to assume that LIBDIR is set, then I suppose the entire code block should be removed or deactivated.

bernt-matthias commented 3 years ago

Somehow I do not like this symlink. This makes the user assume that RepeatMaskerLib is used, but actually its DFAM. Also really bad from the view point of reproducibility.

I guess for the Galaxy tool we should switch to DFAM .. lets just drop non-free components (nobody needs them if there are suitable free alternatives)... what to you think @bgruening?

The Galaxy tool currently uses 2.31.10 and does not work at the moment .. if I get it right the container broke due to the repeatmasker update. Wondering if we could fix this first, e.g. by pinning the repeatmasker requirement. Then we would have a working 2.31.10 container again. Alternatively we could create a folder for the 2.31.10. Asking because I could imagine that we are not that fast to update to the most recent maker version soon.

bgruening commented 3 years ago

I guess for the Galaxy tool we should switch to DFAM .. lets just drop non-free components (nobody needs them if there are suitable free alternatives)... what to you think @bgruening?

Yes I think so as well. :( ping @abretaud

The Galaxy tool currently uses 2.31.10 and does not work at the moment .. if I get it right the container broke due to the repeatmasker update. Wondering if we could fix this first, e.g. by pinning the repeatmasker requirement. Then we would have a working 2.31.10 container again. Alternatively we could create a folder for the 2.31.10. Asking because I could imagine that we are not that fast to update to the most recent maker version soon.

I'm ok with both ways. What every works for you. But I guess we should add a test if possible as way, so that the container fails immediatly.

shjenkins94 commented 3 years ago

One possible workaround for the RepeatMasker problem is to construct a repeat library and specify rm_lib instead of model_org. If model_org isn't defined in the control options then MAKER doesn't check if RepBase is installed.

And looking at the code MAKER uses to run RepeatMasker,

my $command  = "cd $tmp; $RepeatMasker";

   if ($rmlib) {
      $command .= " $q_file -dir $dir -pa $cpus -lib $rmlib";
   }
   elsif($species eq 'simple'){
       my $lib = "$tmp/simple.lib";
       if(!-f $lib){
           (my $tFH, $t_file) = tempfile(DIR => $tmp);
           print $tFH ">(N)n#Dummy_repeat \@root  [S:25]\nnnnnnnnnnnnnnnnnnn\n";
           close($tFH);
           File::Copy::move($t_file, $lib);
       }
       $command .= " $q_file -dir $dir -pa $cpus -lib $lib";
   }
   else {
      $command .= " $q_file -species $species -dir $dir -pa $cpus";
   }
   $command .= " -nolow" if defined($no_low);

The default is to run with a custom RepeatMasker library if it exists, then to run with a simple repeat library if model_org is "simple," then to run with the species model_org.

It seems like if both model_org and rm_lib are specified then RepeatMasker will run twice for each contig.

mengguoqingup commented 2 years ago

hi, my soft version is Python 3.6.10 and MAKER version 3.01.03 and RepeatMasker version 4.1.2-p1 , but i also have the same question, do you know the reason? Thank you!

Species "all" is not known to RepeatMasker. There may not be any TE families defined in the libraries for this species/clade or there may be an error in the spelling. Please check your entry against the NCBI Taxonomy database and/or try using a broader clade or related species instead. The full list of species/clades defined in the library may be obtained using the famdb.py script.

ERROR: RepeatMasker failed --> rank=NA, hostname=localhost.localdomain ERROR: Failed while doing repeat masking ERROR: Chunk failed at level:0, tier_type:1 FAILED CONTIG:tig00000001_pilon

ERROR: Chunk failed at level:2, tier_type:0 FAILED CONTIG:tig00000001_pilon

examining contents of the fasta file and run log