biocore / qiime

Official QIIME 1 software repository. QIIME 2 (https://qiime2.org) has succeeded QIIME 1 as of January 2018.
GNU General Public License v2.0
285 stars 268 forks source link

RTAX assign taxonomy needs expandable memory #949

Open nbokulich opened 11 years ago

nbokulich commented 11 years ago

RTAX currently appears to have a memory cap at ~4GB and the qiime wrapper runs indefinitely without outputting an error message.

e.g., the following command runs indefinitely without error output:

assign_taxonomy.py -t /home/nbokulic/ref_seq_dbs/Silva_108/taxa_mapping/Silva_RDP_taxa_mapping_species.txt -i /home/nbokulic/short-read-tax-assignment/data/qiime-mock-community/L18S-1/rep_set.fna -o /home/nbokulic/laura/multiple_assign_taxonomy/L18S-1/rtax/ -m rtax -r /home/nbokulic/ref_seq_dbs/Silva_108/rep_set/Silva_108_rep_set.fna --read_1_seqs_fp /home/nbokulic/laura/sl/r3n0p75/seqs.fna --amplicon_id_regex '(\S+)\s(\S+?)\/' --header_id_regex '\S+\s+(\S+?)\/'

However, running the equivalent directly in rtax:

rtax -t /home/nbokulic/ref_seq_dbs/Silva_108/taxa_mapping/Silva_RDP_taxa_mappingspecies.txt -a /home/nbokulic/short-read-tax-assignment/data/qiime-mock-community/L18S-1/rep_set.fna -o /home/nbokulic/laura/multiple_assign_taxonomy/L18S-1/rtax/ -r /home/nbokulic/ref_seq_dbs/Silva_108/rep_set/Silva_108_rep_set.fna -i '\S+\s+(\S+?)\/'

outputs the following:

/share/apps/qiime-1.6.0/bin/usearch --quiet --global --iddef 2 --query 2 --db /home/nbokulic/ref_seq_dbs/silva_18S_104/rep_set/silva_104_rep_set.fasta --uc /tmp/78459.1.all.q/7jr8NYguNC/a --id 0.99 --maxaccepts 1000 --maxrejects 128 --nowordcountreject

Out of memory mymalloc(140204), curr 4.15e+09 bytes

/share/apps/qiime-1.6.0/bin/usearch --quiet --global --iddef 2 --query 2 --db /home/nbokulic/ref_seq_dbs/silva_18S_104/rep_set/silva_104_rep_set.fasta --uc /tmp/78459.1.all.q/7jr8NYguNC/a --id 0.99 --maxaccepts 1000 --maxrejects 128 --nowordcountreject

---Fatal error--- Out of memory, mymalloc(140204), curr 4.15e+09 bytes

When I allot more memory to the job on my system, this same error message is output.

gregcaporaso commented 11 years ago

@davidsoergel, could you check into this when you're working on the other rtax-related issues next week?

davidsoergel commented 11 years ago

The 4 GB limit is intrinsic to the 32-bit version of usearch. The main thing driving memory usage is the size of the reference database, not of the query sets. The only alternatives I know of are:

a) pay big bucks for the 64-bit version,

b) somehow limit the size of the reference database (e.g., by using a 97%-clustered reference set instead of 99%, etc.),

c) split up the reference database into shards, classify against each one individually, and collate the results;

d) use a different version of usearch that may require less memory. For instance, using the default parameters, usearch 4.x requires less memory than usearch 5.x.

e) It is possible that usearch 5 can be tuned using command-line options to use less memory, but I haven't explored that in any detail. Tweaking the usearch command line may require hacking in the rtax scripts, though, which means that Qiime developers can't reasonably provide support for your setup after that.

antgonza commented 11 years ago

a) out of curiosity, how much?

rob-knight commented 11 years ago

OK so the message I'm getting is that we really have to find a free alternative to usearch for a range of reasons including this? Chris and Mihai, any comments on current progress on that front? Has anyone re-evaluated cd-hit or bowtie as alternatives recently?

On Aug 6, 2013, at 5:54 PM, davidsoergel notifications@github.com<mailto:notifications@github.com> wrote:

The 4 GB limit is intrinsic to the 32-bit version of usearch. The main thing driving memory usage is the size of the reference database, not of the query sets. The only alternatives I know of are:

a) pay big bucks for the 64-bit version,

b) somehow limit the size of the reference database (e.g., by using a 97%-clustered reference set instead of 99%, etc.),

c) split up the reference database into shards, classify against each one individually, and collate the results;

d) use a different version of usearch that may require less memory. For instance, using the default parameters, usearch 4.x requires less memory than usearch 5.x.

e) It is possible that usearch 5 can be tuned using command-line options to use less memory, but I haven't explored that in any detail. Tweaking the usearch command line may require hacking in the rtax scripts, though, which means that Qiime developers can't reasonably provide support for your setup after that.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22220721.

rob-knight commented 11 years ago

Let's talk sometime next week if that works for you.

Mihai

Rob Knight Rob.Knight@Colorado.EDU wrote:

OK so the message I'm getting is that we really have to find a free alternative to usearch for a range of reasons including this? Chris and Mihai, any comments on current progress on that front? Has anyone re-evaluated cd-hit or bowtie as alternatives recently?

On Aug 6, 2013, at 5:54 PM, davidsoergel notifications@github.com wrote:

The 4 GB limit is intrinsic to the 32-bit version of usearch. The main thing driving memory usage is the size of the reference database, not of the query sets. The only alternatives I know of are:

a) pay big bucks for the 64-bit version,

b) somehow limit the size of the reference database (e.g., by using a 97%-clustered reference set instead of 99%, etc.),

c) split up the reference database into shards, classify against each one individually, and collate the results;

d) use a different version of usearch that may require less memory. For instance, using the default parameters, usearch 4.x requires less memory than usearch 5.x.

e) It is possible that usearch 5 can be tuned using command-line options to use less memory, but I haven't explored that in any detail. Tweaking the usearch command line may require hacking in the rtax scripts, though, which means that Qiime developers can't reasonably provide support for your setup after that.

— Reply to this email directly or view it on GitHub.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

cmhill-zz commented 11 years ago

It's been in submission limbo for the past few months: https://github.com/qiime/qiime/pull/706 @gregcaporaso

We should discuss what more is needed from DNACLUST to serve as a free replacement for USEARCH and what more is needed for submission.

gregcaporaso commented 11 years ago

Sorry about that, I lost track of this pull request but will review the code this week.

Greg

On Tue, Aug 6, 2013 at 5:31 PM, cmhill notifications@github.com wrote:

It's been in submission limbo for the past few months: #706https://github.com/qiime/qiime/issues/706 @gregcaporaso https://github.com/gregcaporaso

We should discuss what more is needed from DNACLUST to serve as a free replacement for USEARCH and what more is needed for submission.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22222154 .

davidsoergel commented 11 years ago

Re 64-bit usearch paid licenses: I can't find the price list online anymore, but a copy I have stashed from a couple years ago indicates it's over $4k per CPU per year (!).

davidsoergel commented 11 years ago

It remains an open bug that the RTAX wrapper apparently does not terminate when the underlying process runs out of memory.

davidsoergel commented 11 years ago

Also: while switching to a different search program like DNACLUST may be a good idea for Qiime in general, I'm afraid RTAX is pretty deeply tied to usearch. It should certainly be possible to rework it to function with a different search engine--in fact, depending on the available options, that could well improve performance or make the code cleaner etc. I won't be able to do this myself, but will be happy to consult with anyone who wants to take it on. I should note that there's no special reason why RTAX needs to be in Perl; in the Qiime context, it might make sense to just rewrite it in Python.

davidsoergel commented 11 years ago

To reduce memory usage in the meantime, I just found this suggestion from @jrvalverde:

However, you may find that running RTax against the latest greengenes requires more memory than the 32bit version can handle (I did have that problem). If that is the case, you may want to try using VAMPS databases instead for the classification.

rob-knight commented 11 years ago

Yes that would be great and I am in town: Ulla could you coordinate?

Rob

On Aug 6, 2013, at 6:25 PM, "mpop@umiacs.umd.edumailto:mpop@umiacs.umd.edu" mpop@umiacs.umd.edu<mailto:mpop@umiacs.umd.edu> wrote:

Let's talk sometime next week if that works for you.

Mihai

Rob Knight Rob.Knight@Colorado.EDU<mailto:Rob.Knight@Colorado.EDU> wrote: OK so the message I'm getting is that we really have to find a free alternative to usearch for a range of reasons including this? Chris and Mihai, any comments on current progress on that front? Has anyone re-evaluated cd-hit or bowtie as alternatives recently?

On Aug 6, 2013, at 5:54 PM, davidsoergel notifications@github.com<mailto:notifications@github.com> wrote:

The 4 GB limit is intrinsic to the 32-bit version of usearch. The main thing driving memory usage is the size of the reference database, not of the query sets. The only alternatives I know of are:

a) pay big bucks for the 64-bit version,

b) somehow limit the size of the reference database (e.g., by using a 97%-clustered reference set instead of 99%, etc.),

c) split up the reference database into shards, classify against each one individually, and collate the results;

d) use a different version of usearch that may require less memory. For instance, using the default parameters, usearch 4.x requires less memory than usearch 5.x.

e) It is possible that usearch 5 can be tuned using command-line options to use less memory, but I haven't explored that in any detail. Tweaking the usearch command line may require hacking in the rtax scripts, though, which means that Qiime developers can't reasonably provide support for your setup after that.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22220721.

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

rob-knight commented 11 years ago

I think Greg just noticed so I'm glad this is maybe prompting some action. Thanks and apologies!

On Aug 6, 2013, at 6:32 PM, "cmhill" notifications@github.com<mailto:notifications@github.com> wrote:

It's been in submission limbo for the past few months: #706https://github.com/qiime/qiime/issues/706 @gregcaporasohttps://github.com/gregcaporaso

We should discuss what more is needed from DNACLUST to serve as a free replacement for USEARCH and what more is needed for submission.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22222154.

rob-knight commented 11 years ago

Thanks Greg!

On Aug 6, 2013, at 6:54 PM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

Sorry about that, I lost track of this pull request but will review the code this week.

Greg

On Tue, Aug 6, 2013 at 5:31 PM, cmhill notifications@github.com<mailto:notifications@github.com> wrote:

It's been in submission limbo for the past few months: #706https://github.com/qiime/qiime/issues/706 @gregcaporaso https://github.com/gregcaporaso

We should discuss what more is needed from DNACLUST to serve as a free replacement for USEARCH and what more is needed for submission.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22222154 .

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22222949.

rob-knight commented 11 years ago

Ouch.

On Aug 6, 2013, at 7:14 PM, "davidsoergel" notifications@github.com<mailto:notifications@github.com> wrote:

Re 64-bit usearch paid licenses: I can't find the price list online anymore, but a copy I have stashed from a couple years ago indicates it's over $4k per CPU per year (!).

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22223632.

rob-knight commented 11 years ago

Thanks for your willingness to consult. It will obviously promote adoption to rewrite using free tools: might you be able to spend at least a little time on that (with help)?

On Aug 6, 2013, at 7:24 PM, "davidsoergel" notifications@github.com<mailto:notifications@github.com> wrote:

Also: while switching to a different search program like DNACLUST may be a good idea for Qiime in general, I'm afraid RTAX is pretty deeply tied to usearch. It should certainly be possible to rework it to function with a different search engine--in fact, depending on the available options, that could well improve performance or make the code cleaner etc. I won't be able to do this myself, but will be happy to consult with anyone who wants to take it on. I should note that there's no special reason why RTAX needs to be in Perl; in the Qiime context, it might make sense to just rewrite it in Python.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/949#issuecomment-22223930.

davidsoergel commented 11 years ago

Yes, I can spend some time discussing with interested parties how RTAX works, for the sake of easing any modifications or a rewrite. I'm sorry I won't have time to write any code, though, especially if you want to go the Python route.

jairideout commented 10 years ago

Moving to the 2.0 milestone. Probably useful to keep open so that RTAX can be updated to better handle out-of-memory failures.