automating dammit over many assemblies, problem with dependencies

I wrote a script to run dammit separately for many assemblies. The script writes and runs a dammitfile for each command. Example contents of dammitfile:

dammit annotate /mnt/mmetsp/Micromonas_pusilla/SRR1300457/trinity/trinity_out/Trinity.fasta \
--busco-group eukaryota --database-dir /mnt/dammit_databases --n_threads 8

But when the automated script runs with subprocess.Popen("sudo bash"+dammitfile), there is an error that some but not all of the dependencies (TransDecoder, LAST, BUSCO) are not installed (below). I can manually run the same command above and it works fine with no problems. Is there something I can do to fix why the subprocess is not finding the dependencies?

File written: /mnt/mmetsp/Erythrolobus_madagascarensis/SRR1300444/dammit_dir/SRR1300444.dammit.sh

========================================

dammit! a tool for easy de novo transcriptome annotation

Camille Scott 2015

========================================

submodule: annotate


--- Checking PATH for dependencies

          [ ] TransDecoder

          [ ] LAST

          [x] HMMER

          [x] Infernal

          [x] crb-blast

          [x] BLAST+

          [ ] BUSCO

--- Dependency results

          TransDecoder, LAST, BUSCO missing

Install dependencies to continue; exiting[DependencyHandler:ERROR]

Two possibilities. First, you should probably omit the sudo -- dammit doesn't need administrator privileges to run, and it changes the $PATH variable. Second, try calling popen with shell=True, which should export your environment variables. BUSCO, TransDecoder, and LAST were installed manually and the exports are in your .bashrc, so without that being sources (ie, shell=True), they aren't being found.

The relevant docs for popen are here: https://docs.python.org/2/library/subprocess.html#popen-constructor

Lemme know if that helps!

Removing sudo worked, running now. Thank you! I was originally using shell=True, just forgot to put that in my question.

If you stop then restart again, will the pipeline pick up form where it left off? It's running, but at first there were a few misc errors about not finding busco and tblastn results (below). Should I just wait for it to finish to see how it worked?

New DB title:  Trinity.fasta
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 16048 sequences in 0.548864 seconds.
BLAST Database error: CSeqDBAtlas::MapMmap: While mapping file [/mnt/mmetsp/Micromonas_pusilla/SRR1300455/dammit_dir/Trinity.fasta.dammit/Trinity.fasta.busco.results.nin] with 0 bytes allocated, caught exception:
NCBI C++ Exception:
    "/build/buildd/ncbi-blast+-2.2.28/c++/src/corelib/ncbifile.cpp", line 4703: Error: ncbi::CMemoryFileMap::CMemoryFileMap() - To be memory mapped the file must exist: /mnt/mmetsp/Micromonas_pusilla/SRR1300455/dammit_dir/Trinity.fasta.dammit/Trinity.fasta.busco.results.nin

eukaryota
*** Running tBlastN ***
*** Getting coordinates for candidate transcripts! ***
Traceback (most recent call last):
  File "/home/ubuntu/BUSCO_v1.1b1/BUSCO_v1.1b1.py", line 347, in <module>
    f=open('%s_tblastn' % args['abrev'])        #open input file
FileNotFoundError: [Errno 2] No such file or directory: 'Trinity.fasta.busco.results_tblastn'
          [ ] TransDecoder.LongOrfs:Trinity.fasta

CMD: /home/ubuntu/TransDecoder-2.0.1/util/compute_base_probs.pl Trinity.fasta 0 > Trinity.fasta.transdecoder_dir/base_freqs.dat
-first extracting base frequencies, we'll need them later.
CMD: touch Trinity.fasta.transdecoder_dir/base_freqs.dat.ok

- extracting ORFs from transcripts.
-total transcripts to examine: 16048
[16000/16048] = 99.70% done

#################################
### Done preparing long ORFs.  ###
##################################

        Use file: Trinity.fasta.transdecoder_dir/longest_orfs.pep  for Pfam and/or BlastP searches to enable homology-based coding region identification.

        Then, run TransDecoder.Predict for your final coding region predictions.

          [ ] hmmscan:longest_orfs.pep.x.Pfam-A.hmm

It should resume without issues -- if it doesn't, please let me know :)

On Sun, Dec 6, 2015 at 3:41 PM, ljcohen notifications@github.com wrote:

Removing sudo worked, running now. Thank you! I was originally using shell=True, just forgot to put that in my question.

If you stop then restart again, will the pipeline pick up form where it left off? It's running, but at first there were a few misc errors about not finding busco and tblastn results (below). Should I just wait for it to finish to see how it worked?

New DB title: Trinity.fasta Sequence type: Nucleotide Keep Linkouts: T Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 16048 sequences in 0.548864 seconds. BLAST Database error: CSeqDBAtlas::MapMmap: While mapping file [/mnt/mmetsp/Micromonas_pusilla/SRR1300455/dammit_dir/Trinity.fasta.dammit/Trinity.fasta.busco.results.nin] with 0 bytes allocated, caught exception: NCBI C++ Exception: "/build/buildd/ncbi-blast+-2.2.28/c++/src/corelib/ncbifile.cpp", line 4703: Error: ncbi::CMemoryFileMap::CMemoryFileMap() - To be memory mapped the file must exist: /mnt/mmetsp/Micromonas_pusilla/SRR1300455/dammit_dir/Trinity.fasta.dammit/Trinity.fasta.busco.results.nin

eukaryota * Running tBlastN * * Getting coordinates for candidate transcripts! * Traceback (most recent call last): File "/home/ubuntu/BUSCO_v1.1b1/BUSCO_v1.1b1.py", line 347, in f=open('%s_tblastn' % args['abrev']) #open input file FileNotFoundError: [Errno 2] No such file or directory: 'Trinity.fasta.busco.results_tblastn' [ ] TransDecoder.LongOrfs:Trinity.fasta

CMD: /home/ubuntu/TransDecoder-2.0.1/util/compute_base_probs.pl Trinity.fasta 0 > Trinity.fasta.transdecoder_dir/base_freqs.dat -first extracting base frequencies, we'll need them later. CMD: touch Trinity.fasta.transdecoder_dir/base_freqs.dat.ok

extracting ORFs from transcripts. -total transcripts to examine: 16048 [16000/16048] = 99.70% done

#################################

Done preparing long ORFs.

##################################
    Use file: Trinity.fasta.transdecoder_dir/longest_orfs.pep  for Pfam and/or BlastP searches to enable homology-based coding region identification.

    Then, run TransDecoder.Predict for your final coding region predictions.

      [ ] hmmscan:longest_orfs.pep.x.Pfam-A.hmm
— Reply to this email directly or view it on GitHub https://github.com/camillescott/dammit/issues/34#issuecomment-162371028.

Camille Scott

Department of Computer Science Lab for Data Intensive Biology University of California, Davis

camille.scott.w@gmail.com

dib-lab / dammit

automating dammit over many assemblies, problem with dependencies #34

Done preparing long ORFs.