epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
40 stars 4 forks source link

Looping files through SINA? #22

Closed larusnz closed 6 years ago

larusnz commented 6 years ago

I encountered a problem running SINA via command line and was wondering if anyone might be able to suggest a solution?

I am able to run a single fasta file fine with no problems, for example:

sina -i file1.fasta -o file1.output.fasta \ --meta-fmt csv \ --ptdb SSURef_NR99_132_SILVA_13_12_17_opt.arb \ --search --search-db SSURef_NR99_132_SILVA_13_12_17_opt.arb --lca-fields tax_slv

However, I have many files that I’d like to run, so I created a loop as follows:

for i in *.fasta do sina -i $i -o $i.output.fasta \ --meta-fmt csv \ --ptdb SSURef_NR99_132_SILVA_13_12_17_opt.arb \ --search --search-db SSURef_NR99_132_SILVA_13_12_17_opt.arb --lca-fields tax_slv done

When I do this it seems to progress as expected up until alignment of the 16th sequence, at which point it aborts with the following error message:

Time for alignment phase: 41.081814s Terminating PT server…

ARB_PT_SERVER: received shutdown message

I tried using --search-all within the loop and that worked fine, but was too slow. I’d like to run the loop with the PT server, so any suggestions would be much appreciated!

epruesse commented 6 years ago

When I do this it seems to progress as expected up until alignment of the 16th sequence, at which point it aborts with the following error message:

I'm assuming you meant 16th file. The loop you posted should not affect SINA in any way at all. Does that file fail if run directly? Or really only in the loop?

Time for alignment phase: 41.081814s Terminating PT server… ARB_PT_SERVER: received shutdown message

That's not an error, but should always be the last bit. You can in theory start an ARB PT server on your own, and point SINA to the server using "--pt-port" (and "--search-db-port"), to save on startup time with small files. If you don't, SINA will start one itself and terminate it once SINA is finished. That's the output you are seeing, SINA saying Terminating PT server and the PT server than saying received shutdown message.

If the file is empty, that may just mean that nothing in there was sufficiently similar to 16S to even have an alignment. Try without the classifier, that should get you more results.

I tried using --search-all within the loop and that worked fine, but was too slow. I’d like to run the loop with the PT server, so any suggestions would be much appreciated!

Yes, that's more of a debug feature. SINA will use a k-mer heuristic to find the most similar sequences (top 1000 by default) and then uses the alignment to compute a score on those. With --search-all it will check each input sequence against each reference sequence, which with a big database just takes forever indeed. It doesn't gain you much either. You can test by increasing the output from the heuristic and watching the results (not) change (--search-kmer-candidates 10000 shouldn't get you much else than the default, and --search-kmer-candidates 100 should only see a minor benefit on performance).

larusnz commented 6 years ago

No, I was meaning the 16th sequence in the first file (which is why it seemed very strange). I tried running the file directly and it worked fine, it only aborts early when in the loop.

I'll try a few other things and see if I can resolve what it going on - thanks

epruesse commented 6 years ago

Ok. Please close this if you figure out what went wrong. It does sound to me like SINA terminated normally after 16 sequences. Perhaps the command line wasn't exactly the same (forgotten \ at the end of a line in your script or something similar).

larusnz commented 6 years ago

Yes, I see now, you are correct - the loop ran file 10 (containing 16 sequences) before file 1 (containing 600 sequences) . However, the PT Server terminated at the end of running the first file, so the loop failed. Is there any way to run multiple files without the Server terminating?

epruesse commented 6 years ago

However, the PT Server terminated at the end of running the first file, so the loop failed.

No. The PT server terminated, as did SINA, because they were finished. That was not an error. Put echo SINA exited with code $?; into your loop to have bash print the exit code, it should be 0.

Is there any way to run multiple files without the Server terminating?

Quoting myself from above:

That's not an error, but should always be the last bit. You can in theory start an ARB PT server on your own, and point SINA to the server using "--pt-port" (and "--search-db-port"), to save on startup time with small files. If you don't, SINA will start one itself and terminate it once SINA is finished. That's the output you are seeing, SINA saying Terminating PT server and the PT server than saying received shutdown message.

However, it's not really necessary.

Here's a script for running many instances of SINA in parallel on a single large file: https://github.com/epruesse/SINA/blob/master/src/psina

Works just fine for me (if you use that, watch out for memory, no more than one thread per 16GB if you use e.g. the SILVA DB - PT server is quite memory hungry).

larusnz commented 6 years ago

Thanks!