geronimp / graftM

GraftM - Rapid community profiles from metagenomes
http://geronimp.github.io/graftM/
GNU General Public License v3.0
44 stars 16 forks source link

"graftM create" failed with SILVA due to unknown error #248

Closed jcmcnch closed 6 years ago

jcmcnch commented 6 years ago

Hello,

I'm trying to create a graftM package from the SILVA 16S database but am running into an unknown error. Here's an overview of what I've done:

  1. Modified the taxmap file from SILVA to remove duplicate identifiers, which otherwise cause graftM to complain.
  2. Ran graftM create as follows:

graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre

It took a really long time (> 1 week) and then I got this error:

  File "/usr/local/bin/graftM", line 345, in <module>
    Run(args).main()
  File "/usr/local/lib/python2.7/dist-packages/graftm/run.py", line 609, in main
    threads = self.args.threads
  File "/usr/local/lib/python2.7/dist-packages/graftm/create.py", line 616, in main
    self._create_search_hmm(sequences, taxonomy_definition, search_hmm, dereplication_level, threads)
  File "/usr/local/lib/python2.7/dist-packages/graftm/create.py", line 431, in _create_search_hmm
    self._get_hmm_from_alignment(temporary_alignment, search_hmm, temporary_alignment)
  File "/usr/local/lib/python2.7/dist-packages/graftm/create.py", line 144, in _get_hmm_from_alignment
    out = extern.run(cmd)
  File "/usr/local/lib/python2.7/dist-packages/extern/__init__.py", line 41, in run
    stdout)
extern.ExternCalledProcessError: Command hmmbuild -O /tmp/graftm9QjffA.sto '/tmp/graftmYEQiP3_search.hmm' '/tmp/graftmULQ0sN.aln.faa' returned non-zero exit status -6.
STDERR was: Fatal exception (source file esl_alphabet.c, line 928):
realloc for size 135182 failed
STDOUT was: # hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.1b2 (February 2015); http://hmmer.org/
# Copyright (C) 2015 Howard Hughes Medical Institute.
# Freely distributed under the GNU General Public License (GPLv3).
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             /tmp/graftmULQ0sN.aln.faa
# output HMM file:                  /tmp/graftmYEQiP3_search.hmm
# processed alignment resaved to:   /tmp/graftm9QjffA.sto
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen     W eff_nseq re/pos description
#---- -------------------- ----- ----- ----- ----- -------- ------ -----------

Last I checked, it was on the mafft alignment step. Do you have any idea what might be causing this error? Running graftM on a cluster with ~1Tb of RAM so doubt memory is an issue and we have enough hard disk space available currently. Also, is there any way to restart the job from the temporary files generated?

Thanks very much for your help,

Jesse McNichol Fuhrman Lab, USC

wwood commented 6 years ago

Hi Jesse, 1TB is a lot of RAM, but unfortunately it does not seem to be enough. This line

realloc for size 135182 failed

suggests that memory ran out.

It seems to have failed on the hmmbuild step, not the mafft alignment step. As I see you could do 2 things:

  1. Make a HMM outside of graftm somehow, and provide it to graftm with --hmm.
  2. Use some dereplication so that fewer sequences are included in the multiple sequence alignment used to make the HMM with --dereplication_level. This may also have the side effect of increasing the sensitivity of the search procedure since the HMM will be made from a more taxonomically balanced set of sequences.

Good luck, and thanks for trying GraftM. ben

jcmcnch commented 6 years ago

Hi Ben,

Thanks for your quick reply! I will probably try making an HMM separately because it looks like the default in graftM is already to do dereplication at level 6, or genus level (if I've read that correctly). I'd prefer not to lose so much resolution so will see if I can get the HMM build step working outside of graftM.

An unrelated question - I saw some mention in the documentation for graftM that it should work for contigs as well as raw reads (though didn't see any mention of this use case in your recent paper). I tried feeding the program a fasta file of contigs using the --forward flag but it complained about not being able to detect the format even if I explicitly specified it. I saw there is a --bootstrap-contigs flag, which certainly would be very useful but I'm wondering if it's possible to directly feed contigs into the normal graftM graft workflow.

Thanks very much for your help,

Jesse

wwood commented 6 years ago

Hi, In older versions the dereplication level default was 6, but as of 0.11 it is 0. Also, I'm not sure what you mean by 'lose resolution' - the HMM is used for detecting sequences and converting them into an alignment. The taxonomic assignment step is unaffected by the dereplication. Perhaps this is unclear from the documentation?

Yes, it is possible to feed in contigs directly. There has been a few commits since 0.11 that implement some bug fixes so I would recommend using the bleeding edge version. If you are attempting to detect protein sequences however I would recommend feeding in protein sequences (e.g. those generated with prodigal) as the default ORF prediction parameters are really geared towards finding ORFs in short sequences.

HTH, ben

jcmcnch commented 6 years ago

Hi Ben,

OK that makes sense. I was under the mistaken impression that the HMM would do the taxonomic assignment. That's not a limitation of the documentation I think but rather my ignorance as to how HMMs work in the graftM pipeline.

Thanks also for the info on using graftM with contigs, I will do as you suggested with prodigal and update to the latest version and should be able to figure it out from there!

Thanks again for all your help,

Jesse

On Sat, Apr 7, 2018 at 1:58 PM, Ben J Woodcroft notifications@github.com wrote:

Hi, In older versions the dereplication level default was 6, but as of 0.11 it is 0. Also, I'm not sure what you mean by 'lose resolution' - the HMM is used for detecting sequences and converting them into an alignment. The taxonomic assignment step is unaffected by the dereplication. Perhaps this is unclear from the documentation?

Yes, it is possible to feed in contigs directly. There has been a few commits since 0.11 that implement some bug fixes so I would recommend using the bleeding edge version. If you are attempting to detect protein sequences however I would recommend feeding in protein sequences (e.g. those generated with prodigal) as the default ORF prediction parameters are really geared towards finding ORFs in short sequences.

HTH, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379498793, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_KBLyX1R9_5m2hlWfyI0Zb0LUKgDks5tmSh-gaJpZM4TKOU- .

wwood commented 6 years ago

Alrighty, let us know if there is anything else.

jcmcnch commented 6 years ago

Hi Ben,

I am still having trouble running contigs through graftM. I have updated to the latest version with "git clone" and then ran "python setup.py install" to install manually. So I should be on the "bleeding edge" as you recommended. However, when I try to run it either on the contigs or on the proteins predicted from prodigal, I get the error that graftM can't find any reads. I'd be very grateful if you could take a quick look to see what's wrong.

Here is what I'm running:

'graftM graft --forward test.fa --input_sequence_type aminoacid --graftm_package 7.39.2013_08_greengenes_97_otus.better_tree.gpkg --output_directory prodigal_test --no_merge_reads --force'

My test.fa file looks like this (I tried with the extension variously as faa, fasta and fa):

'

SN_66772_1 # 3 # 101 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.232 NKYDFSDLTNKDKTELFLSGITQGQDFFKDKL SN_66772_2 # 74 # 649 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.267 MRYFIHCDNSFRKQQLIASKLDFLNTDIDEEDLQLNTITFLYGYTYSKINHAFSKMNVRL NRRNPDEISNCLTYCDCVILFHNFIEYSNGMQSIIDTCLKEGIPLVIFSDHIKKGFLSNA TGDLAITQEFPKIATNNKIIKVQNFNFHPYKYVPNMSFKQVVELTRQNYAELNEEKAERR IKYHNLSLKKS SN_66772_3 # 720 # 1133 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.326 MFENFRKYLTCENIINFILIVIIIILLVNIIQQKFFSKSNSEPLTNVSDSVKEKVKNGKL VVYKTNSCGYCKKFMKLLNDLGLETYVRVVDVGTQSGKIEYAALGEKGVPIIRCETTGEI HVGYNPDIEDLLHKLKM SN_66772_4 # 1144 # 1626 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.277 MEKIEEKKKEIESILTNLQVIGSLNKYEKLIFKNTWELDVDKRYLQSVRRYFSGDSREAI LNFLSKLFTSIDTVVDFLLTTYKLSKSEALHDYVTNTLHDFYINTLNAKKGVTNLIETYS DDVAVKSKLELYFNMLDRKSLSLKQKLGIRDEAFTISNST '

And the output I get is:

' /home/anaconda/miniconda2/envs/graftM/lib/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters

                         GraftM 0.11.1

                            GRAFT

                   Joel Boyd, Ben Woodcroft

                                                     __/__
                                              ______|
      _- - _                         ________|      |_____/
       - -            -             |        |____/_
       - _     >>>>  -   >>>>   ____|
      - _-  -         -             |      ______
         - _                        |_____|
       -                                  |______

04/09/2018 04:48:43 PM INFO: Working on test 04/09/2018 04:48:43 PM INFO: 0 read(s) detected 04/09/2018 04:48:43 PM INFO: No reads found in test 04/09/2018 04:48:43 PM ERROR: No hits in any of the provided files. Cannot continue with no reads to assign taxonomy to. '

Would be grateful for any pointers you have.

Cheers,

Jesse

On Sat, Apr 7, 2018 at 7:06 PM, Ben J Woodcroft notifications@github.com wrote:

Alrighty, let us know if there is anything else.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379513239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_Enu-PcLnmO8nBPLnILOc64waM07ks5tmXCxgaJpZM4TKOU- .

wwood commented 6 years ago

Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.

You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben

jcmcnch commented 6 years ago

Hi Ben,

Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :

https://data.ace.uq.edu.au/public/graftm/7/7.23.recombination_protein_recA.gpkg.tar.gz

I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?

Jesse

On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft notifications@github.com wrote:

Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.

You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .

jcmcnch commented 6 years ago

Hi Ben,

I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!

Jesse

On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:

Hi Ben,

Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :

https://data.ace.uq.edu.au/public/graftm/7/7.23. recombination_protein_recA.gpkg.tar.gz

I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?

Jesse

On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft notifications@github.com wrote:

Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.

You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .

jcmcnch commented 6 years ago

Hi Ben,

I do have another question, related to my initial post. As you suggested, I am trying to play with the dereplication level flag to reduce the number of sequences that are processed in the alignment step so it doesn't use so much memory. I've tried it with level 5, and level 4 and both times it has also crashed (apparently also due to memory issues). I am currently re-running it with level 3, but noticed something a bit puzzling in the log file (see below for log file output). Basically, it says that it's removing 97% of the sequences from the alignment which are redundant. However, it then says "After removing sequences from alignment, 695171 remain”. 695171 is the exact number of sequences that are in the alignment file I've provided, so it seems to be actually not removing anything. Do you have any idea why this might be? I've included what the alignment file I'm using looks like (taken from SILVA), as well as the command to run graftM for your reference. Would appreciate any ideas you might have for how to debug this!

Thanks once again for your help,

Jesse

log file output:

04/16/2018 05:58:41 PM DEBUG: Sequence AJ408982 redundant at 3 rank in the taxonomy file level: Bacteroidia 04/16/2018 05:58:41 PM DEBUG: Sequence JN637315 redundant at 3 rank in the taxonomy file level: Actinobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence HM267286 redundant at 3 rank in the taxonomy file level: Gammaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence JQ580408 redundant at 3 rank in the taxonomy file level: Deltaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence KC185848 redundant at 3 rank in the taxonomy file level: Holozoa 04/16/2018 05:58:41 PM INFO: Removing 675353 sequences from the search HMM that are redundant at the 3 rank in the taxonomy file 04/16/2018 05:59:44 PM DEBUG: After removing sequences from alignment, 695171 remain 04/16/2018 05:59:44 PM DEBUG: Aligning sequences using mafft 04/16/2018 06:00:05 PM DEBUG: Running extern cmd: mafft --anysymbol --thread 20 --auto /dev/stdin > /tmp/graftmocRCvZ.aln.faa

A snapshot of the alignment file I'm using:

GY203941.1.1493 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 7;unidentified ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ..........................................................................CU--C- AG-G-A-U--G-A--A-C-G-C--U-G-G-C--U-A--C-A-G-G----------C----U-U--AA-C--AC-A----U -G--C----A-A-G--U-C--GA-G-GG-----------G-AAA--C-G--G--------------C-A----------- ---UU-G-A-G-U-G-----------------------------------------------------------------


------------------CUU-G---------------------------------------------------------


---------------------------------------------C--A--C-U-G-A--AUG----------------G -A-C----G--UC--G-------AC-C-G-G-C-GC-A--C-------------G-G-----------------------


----------------------------G--U--G-A--G-U-A---AC-G-C-G--U-A-UC---CAA--C-CU-U--C -C-CGU--UA-C--------------------------------------------------------------------

The command I used to run graftM create:

graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre --dereplication_level 3 --threads 20 --log 180416_graftM_create_log.txt

On Mon, Apr 9, 2018 at 6:34 PM, Jesse McNichol jcmcnichol@gmail.com wrote:

Hi Ben,

I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!

Jesse

On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:

Hi Ben,

Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :

https://data.ace.uq.edu.au/public/graftm/7/7.23.recombinatio n_protein_recA.gpkg.tar.gz

I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?

Jesse

On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft <notifications@github.com

wrote:

Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.

You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .

geronimp commented 6 years ago

Hi Jesse!

I just had a quick look, but I couldn't replicate this problem on a smaller data set. I'll have a look at using the SILVA database which you're using instead, but in the meantime, would you be able to send through the whole log file? Just attach it to an email and send it my address found here: http://ecogenomic.org/personnel/mr-joel-boyd

Thank you,

Joel

PhD student, Australian Centre for Ecogenomics (ACE)

On 18 April 2018 at 02:12, jcmcnch notifications@github.com wrote:

Hi Ben,

I do have another question, related to my initial post. As you suggested, I am trying to play with the dereplication level flag to reduce the number of sequences that are processed in the alignment step so it doesn't use so much memory. I've tried it with level 5, and level 4 and both times it has also crashed (apparently also due to memory issues). I am currently re-running it with level 3, but noticed something a bit puzzling in the log file (see below for log file output). Basically, it says that it's removing 97% of the sequences from the alignment which are redundant. However, it then says "After removing sequences from alignment, 695171 remain”. 695171 is the exact number of sequences that are in the alignment file I've provided, so it seems to be actually not removing anything. Do you have any idea why this might be? I've included what the alignment file I'm using looks like (taken from SILVA), as well as the command to run graftM for your reference. Would appreciate any ideas you might have for how to debug this!

Thanks once again for your help,

Jesse

log file output:

04/16/2018 05:58:41 PM DEBUG: Sequence AJ408982 redundant at 3 rank in the taxonomy file level: Bacteroidia 04/16/2018 05:58:41 PM DEBUG: Sequence JN637315 redundant at 3 rank in the taxonomy file level: Actinobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence HM267286 redundant at 3 rank in the taxonomy file level: Gammaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence JQ580408 redundant at 3 rank in the taxonomy file level: Deltaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence KC185848 redundant at 3 rank in the taxonomy file level: Holozoa 04/16/2018 05:58:41 PM INFO: Removing 675353 sequences from the search HMM that are redundant at the 3 rank in the taxonomy file 04/16/2018 05:59:44 PM DEBUG: After removing sequences from alignment, 695171 remain 04/16/2018 05:59:44 PM DEBUG: Aligning sequences using mafft 04/16/2018 06:00:05 PM DEBUG: Running extern cmd: mafft --anysymbol --thread 20 --auto /dev/stdin > /tmp/graftmocRCvZ.aln.faa

A snapshot of the alignment file I'm using:

GY203941.1.1493 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 7;unidentified ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ ..............CU--C- AG-G-A-U--G-A--A-C-G-C--U-G-G-C--U-A--C-A-G-G----------C---- U-U--AA-C--AC-A----U -G--C----A-A-G--U-C--GA-G-GG-----------G-AAA--C-G--G-------- ------C-A----------- ---UU-G-A-G-U-G---------------------------------------------





------------------CUU-G-------------------------------------





---------------------------------------------C--A--C-U-G-A-- AUG----------------G -A-C----G--UC--G-------AC-C-G-G-C-GC-A--C-------------G-G---





----------------------------G--U--G-A--G-U-A---AC-G-C-G--U- A-UC---CAA--C-CU-U--C -C-CGU--UA-C------------------------------------------------



The command I used to run graftM create:

graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre --dereplication_level 3 --threads 20 --log 180416_graftM_create_log.txt

On Mon, Apr 9, 2018 at 6:34 PM, Jesse McNichol jcmcnichol@gmail.com wrote:

Hi Ben,

I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!

Jesse

On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:

Hi Ben,

Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :

https://data.ace.uq.edu.au/public/graftm/7/7.23.recombinatio n_protein_recA.gpkg.tar.gz

I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?

Jesse

On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft < notifications@github.com

wrote:

Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.

You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/geronimp/graftM/issues/248#issuecomment-379931789 , or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU- se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-382051032, or mute the thread https://github.com/notifications/unsubscribe-auth/AHc1WsDA_3T9dFaqr9Qg0_XBnI7lFCP4ks5tphRhgaJpZM4TKOU- .

SCarrSuperstar commented 4 years ago

Hi Jesse, Were you ever able to create that Silva package? I also tried to make one, but was getting the same two errors (either I ran out of disk space or ram). I'm guessing that once you made the Silva package that it required a lot of ram just to use it. So I'm thinking I'll just go with the default green genes package, but I would love to know how your story ended.

jcmcnch commented 4 years ago

Hi @SCarrSuperstar

I actually gave up in the end trying to make my own package, and used the greengenes package. This worked well for 16S, but unfortunately the coverage (across the regions of the molecule) for 18S was extremely uneven (I think the greengenes + 18S HMM model misses some parts of the longer 18S).

In the end I moved towards using phyloFlash (https://github.com/HRGV/phyloFlash) which (in my hands, using open ocean samples) retrieves more than 2x the 16S and 18S sequences and gives much more even coverage across the 18S. Recommend trying phyloFlash if your goal is just to retrieve SSU rRNA from metagenomes using SILVA.

Cheers, Jesse