Closed jcmcnch closed 6 years ago
Hi Jesse, 1TB is a lot of RAM, but unfortunately it does not seem to be enough. This line
realloc for size 135182 failed
suggests that memory ran out.
It seems to have failed on the hmmbuild step, not the mafft alignment step. As I see you could do 2 things:
--hmm
.--dereplication_level
. This may also have the side effect of increasing the sensitivity of the search procedure since the HMM will be made from a more taxonomically balanced set of sequences.Good luck, and thanks for trying GraftM. ben
Hi Ben,
Thanks for your quick reply! I will probably try making an HMM separately because it looks like the default in graftM is already to do dereplication at level 6, or genus level (if I've read that correctly). I'd prefer not to lose so much resolution so will see if I can get the HMM build step working outside of graftM.
An unrelated question - I saw some mention in the documentation for graftM that it should work for contigs as well as raw reads (though didn't see any mention of this use case in your recent paper). I tried feeding the program a fasta file of contigs using the --forward flag but it complained about not being able to detect the format even if I explicitly specified it. I saw there is a --bootstrap-contigs flag, which certainly would be very useful but I'm wondering if it's possible to directly feed contigs into the normal graftM graft workflow.
Thanks very much for your help,
Jesse
Hi, In older versions the dereplication level default was 6, but as of 0.11 it is 0. Also, I'm not sure what you mean by 'lose resolution' - the HMM is used for detecting sequences and converting them into an alignment. The taxonomic assignment step is unaffected by the dereplication. Perhaps this is unclear from the documentation?
Yes, it is possible to feed in contigs directly. There has been a few commits since 0.11 that implement some bug fixes so I would recommend using the bleeding edge version. If you are attempting to detect protein sequences however I would recommend feeding in protein sequences (e.g. those generated with prodigal) as the default ORF prediction parameters are really geared towards finding ORFs in short sequences.
HTH, ben
Hi Ben,
OK that makes sense. I was under the mistaken impression that the HMM would do the taxonomic assignment. That's not a limitation of the documentation I think but rather my ignorance as to how HMMs work in the graftM pipeline.
Thanks also for the info on using graftM with contigs, I will do as you suggested with prodigal and update to the latest version and should be able to figure it out from there!
Thanks again for all your help,
Jesse
On Sat, Apr 7, 2018 at 1:58 PM, Ben J Woodcroft notifications@github.com wrote:
Hi, In older versions the dereplication level default was 6, but as of 0.11 it is 0. Also, I'm not sure what you mean by 'lose resolution' - the HMM is used for detecting sequences and converting them into an alignment. The taxonomic assignment step is unaffected by the dereplication. Perhaps this is unclear from the documentation?
Yes, it is possible to feed in contigs directly. There has been a few commits since 0.11 that implement some bug fixes so I would recommend using the bleeding edge version. If you are attempting to detect protein sequences however I would recommend feeding in protein sequences (e.g. those generated with prodigal) as the default ORF prediction parameters are really geared towards finding ORFs in short sequences.
HTH, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379498793, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_KBLyX1R9_5m2hlWfyI0Zb0LUKgDks5tmSh-gaJpZM4TKOU- .
Alrighty, let us know if there is anything else.
Hi Ben,
I am still having trouble running contigs through graftM. I have updated to the latest version with "git clone" and then ran "python setup.py install" to install manually. So I should be on the "bleeding edge" as you recommended. However, when I try to run it either on the contigs or on the proteins predicted from prodigal, I get the error that graftM can't find any reads. I'd be very grateful if you could take a quick look to see what's wrong.
Here is what I'm running:
'graftM graft --forward test.fa --input_sequence_type aminoacid --graftm_package 7.39.2013_08_greengenes_97_otus.better_tree.gpkg --output_directory prodigal_test --no_merge_reads --force'
My test.fa file looks like this (I tried with the extension variously as faa, fasta and fa):
'
SN_66772_1 # 3 # 101 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.232 NKYDFSDLTNKDKTELFLSGITQGQDFFKDKL SN_66772_2 # 74 # 649 # -1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.267 MRYFIHCDNSFRKQQLIASKLDFLNTDIDEEDLQLNTITFLYGYTYSKINHAFSKMNVRL NRRNPDEISNCLTYCDCVILFHNFIEYSNGMQSIIDTCLKEGIPLVIFSDHIKKGFLSNA TGDLAITQEFPKIATNNKIIKVQNFNFHPYKYVPNMSFKQVVELTRQNYAELNEEKAERR IKYHNLSLKKS SN_66772_3 # 720 # 1133 # 1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.326 MFENFRKYLTCENIINFILIVIIIILLVNIIQQKFFSKSNSEPLTNVSDSVKEKVKNGKL VVYKTNSCGYCKKFMKLLNDLGLETYVRVVDVGTQSGKIEYAALGEKGVPIIRCETTGEI HVGYNPDIEDLLHKLKM SN_66772_4 # 1144 # 1626 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.277 MEKIEEKKKEIESILTNLQVIGSLNKYEKLIFKNTWELDVDKRYLQSVRRYFSGDSREAI LNFLSKLFTSIDTVVDFLLTTYKLSKSEALHDYVTNTLHDFYINTLNAKKGVTNLIETYS DDVAVKSKLELYFNMLDRKSLSLKQKLGIRDEAFTISNST '
And the output I get is:
'
/home/anaconda/miniconda2/envs/graftM/lib/python2.7/site-packages/h5py/init.py:36:
FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as
np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
GraftM 0.11.1
GRAFT
Joel Boyd, Ben Woodcroft
__/__
______|
_- - _ ________| |_____/
- - - | |____/_
- _ >>>> - >>>> ____|
- _- - - | ______
- _ |_____|
- |______
04/09/2018 04:48:43 PM INFO: Working on test 04/09/2018 04:48:43 PM INFO: 0 read(s) detected 04/09/2018 04:48:43 PM INFO: No reads found in test 04/09/2018 04:48:43 PM ERROR: No hits in any of the provided files. Cannot continue with no reads to assign taxonomy to. '
Would be grateful for any pointers you have.
Cheers,
Jesse
On Sat, Apr 7, 2018 at 7:06 PM, Ben J Woodcroft notifications@github.com wrote:
Alrighty, let us know if there is anything else.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379513239, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_Enu-PcLnmO8nBPLnILOc64waM07ks5tmXCxgaJpZM4TKOU- .
Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.
You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben
Hi Ben,
Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :
https://data.ace.uq.edu.au/public/graftm/7/7.23.recombination_protein_recA.gpkg.tar.gz
I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?
Jesse
On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft notifications@github.com wrote:
Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.
You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .
Hi Ben,
I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!
Jesse
On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:
Hi Ben,
Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :
https://data.ace.uq.edu.au/public/graftm/7/7.23. recombination_protein_recA.gpkg.tar.gz
I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?
Jesse
On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft notifications@github.com wrote:
Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.
You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .
Hi Ben,
I do have another question, related to my initial post. As you suggested, I am trying to play with the dereplication level flag to reduce the number of sequences that are processed in the alignment step so it doesn't use so much memory. I've tried it with level 5, and level 4 and both times it has also crashed (apparently also due to memory issues). I am currently re-running it with level 3, but noticed something a bit puzzling in the log file (see below for log file output). Basically, it says that it's removing 97% of the sequences from the alignment which are redundant. However, it then says "After removing sequences from alignment, 695171 remain”. 695171 is the exact number of sequences that are in the alignment file I've provided, so it seems to be actually not removing anything. Do you have any idea why this might be? I've included what the alignment file I'm using looks like (taken from SILVA), as well as the command to run graftM for your reference. Would appreciate any ideas you might have for how to debug this!
Thanks once again for your help,
Jesse
log file output:
04/16/2018 05:58:41 PM DEBUG: Sequence AJ408982 redundant at 3 rank in the taxonomy file level: Bacteroidia 04/16/2018 05:58:41 PM DEBUG: Sequence JN637315 redundant at 3 rank in the taxonomy file level: Actinobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence HM267286 redundant at 3 rank in the taxonomy file level: Gammaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence JQ580408 redundant at 3 rank in the taxonomy file level: Deltaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence KC185848 redundant at 3 rank in the taxonomy file level: Holozoa 04/16/2018 05:58:41 PM INFO: Removing 675353 sequences from the search HMM that are redundant at the 3 rank in the taxonomy file 04/16/2018 05:59:44 PM DEBUG: After removing sequences from alignment, 695171 remain 04/16/2018 05:59:44 PM DEBUG: Aligning sequences using mafft 04/16/2018 06:00:05 PM DEBUG: Running extern cmd: mafft --anysymbol --thread 20 --auto /dev/stdin > /tmp/graftmocRCvZ.aln.faa
A snapshot of the alignment file I'm using:
GY203941.1.1493 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 7;unidentified ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ................................................................................ ..........................................................................CU--C- AG-G-A-U--G-A--A-C-G-C--U-G-G-C--U-A--C-A-G-G----------C----U-U--AA-C--AC-A----U -G--C----A-A-G--U-C--GA-G-GG-----------G-AAA--C-G--G--------------C-A----------- ---UU-G-A-G-U-G-----------------------------------------------------------------
------------------CUU-G---------------------------------------------------------
---------------------------------------------C--A--C-U-G-A--AUG----------------G -A-C----G--UC--G-------AC-C-G-G-C-GC-A--C-------------G-G-----------------------
----------------------------G--U--G-A--G-U-A---AC-G-C-G--U-A-UC---CAA--C-CU-U--C -C-CGU--UA-C--------------------------------------------------------------------
The command I used to run graftM create:
graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre --dereplication_level 3 --threads 20 --log 180416_graftM_create_log.txt
On Mon, Apr 9, 2018 at 6:34 PM, Jesse McNichol jcmcnichol@gmail.com wrote:
Hi Ben,
I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!
Jesse
On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:
Hi Ben,
Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :
https://data.ace.uq.edu.au/public/graftm/7/7.23.recombinatio n_protein_recA.gpkg.tar.gz
I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?
Jesse
On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft <notifications@github.com
wrote:
Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.
You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-379931789, or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU-se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .
Hi Jesse!
I just had a quick look, but I couldn't replicate this problem on a smaller data set. I'll have a look at using the SILVA database which you're using instead, but in the meantime, would you be able to send through the whole log file? Just attach it to an email and send it my address found here: http://ecogenomic.org/personnel/mr-joel-boyd
Thank you,
Joel
PhD student, Australian Centre for Ecogenomics (ACE)
On 18 April 2018 at 02:12, jcmcnch notifications@github.com wrote:
Hi Ben,
I do have another question, related to my initial post. As you suggested, I am trying to play with the dereplication level flag to reduce the number of sequences that are processed in the alignment step so it doesn't use so much memory. I've tried it with level 5, and level 4 and both times it has also crashed (apparently also due to memory issues). I am currently re-running it with level 3, but noticed something a bit puzzling in the log file (see below for log file output). Basically, it says that it's removing 97% of the sequences from the alignment which are redundant. However, it then says "After removing sequences from alignment, 695171 remain”. 695171 is the exact number of sequences that are in the alignment file I've provided, so it seems to be actually not removing anything. Do you have any idea why this might be? I've included what the alignment file I'm using looks like (taken from SILVA), as well as the command to run graftM for your reference. Would appreciate any ideas you might have for how to debug this!
Thanks once again for your help,
Jesse
log file output:
04/16/2018 05:58:41 PM DEBUG: Sequence AJ408982 redundant at 3 rank in the taxonomy file level: Bacteroidia 04/16/2018 05:58:41 PM DEBUG: Sequence JN637315 redundant at 3 rank in the taxonomy file level: Actinobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence HM267286 redundant at 3 rank in the taxonomy file level: Gammaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence JQ580408 redundant at 3 rank in the taxonomy file level: Deltaproteobacteria 04/16/2018 05:58:41 PM DEBUG: Sequence KC185848 redundant at 3 rank in the taxonomy file level: Holozoa 04/16/2018 05:58:41 PM INFO: Removing 675353 sequences from the search HMM that are redundant at the 3 rank in the taxonomy file 04/16/2018 05:59:44 PM DEBUG: After removing sequences from alignment, 695171 remain 04/16/2018 05:59:44 PM DEBUG: Aligning sequences using mafft 04/16/2018 06:00:05 PM DEBUG: Running extern cmd: mafft --anysymbol --thread 20 --auto /dev/stdin > /tmp/graftmocRCvZ.aln.faa
A snapshot of the alignment file I'm using:
GY203941.1.1493 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 7;unidentified ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ .................... ............................................................ ..............CU--C- AG-G-A-U--G-A--A-C-G-C--U-G-G-C--U-A--C-A-G-G----------C---- U-U--AA-C--AC-A----U -G--C----A-A-G--U-C--GA-G-GG-----------G-AAA--C-G--G-------- ------C-A----------- ---UU-G-A-G-U-G---------------------------------------------
------------------CUU-G-------------------------------------
---------------------------------------------C--A--C-U-G-A-- AUG----------------G -A-C----G--UC--G-------AC-C-G-G-C-GC-A--C-------------G-G---
----------------------------G--U--G-A--G-U-A---AC-G-C-G--U- A-UC---CAA--C-CU-U--C -C-CGU--UA-C------------------------------------------------
The command I used to run graftM create:
graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre --dereplication_level 3 --threads 20 --log 180416_graftM_create_log.txt
On Mon, Apr 9, 2018 at 6:34 PM, Jesse McNichol jcmcnichol@gmail.com wrote:
Hi Ben,
I think I've figured it out. When it says 0 reads detected, I thought it meant it couldn't read the fasta file I provided. Now I understand that it actually means that graftM can't find any matches for, say, 16S in the sequences provided. Thanks for putting up with my naive questions!
Jesse
On Mon, Apr 9, 2018 at 5:17 PM, Jesse McNichol jcmcnichol@gmail.com wrote:
Hi Ben,
Ah yes, stupid mistake on my part (I do know that actually!). But I get the same error if I run it with a package that is obviously a protein, i.e. :
https://data.ace.uq.edu.au/public/graftm/7/7.23.recombinatio n_protein_recA.gpkg.tar.gz
I also get the same error if I run it with the contigs as DNA, but can run it fine on raw reads. Any ideas what I'm doing wrong?
Jesse
On Mon, Apr 9, 2018 at 5:08 PM, Ben J Woodcroft < notifications@github.com
wrote:
Hi, Perhaps I have mislead you. Looking for 16S in a set of protein sequences won't work, because 16S isn't a protein. In this case I would suggest simply feeding in the contigs as nucleotide sequences.
You'd want to feed in protein sequences when you are looking for protein coding genes e.g. McrA. HTH, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/geronimp/graftM/issues/248#issuecomment-379931789 , or mute the thread https://github.com/notifications/unsubscribe-auth/AFrj_OnO0TvBYU- se95BjIpIYFSpbKMBks5tm_gZgaJpZM4TKOU- .
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/issues/248#issuecomment-382051032, or mute the thread https://github.com/notifications/unsubscribe-auth/AHc1WsDA_3T9dFaqr9Qg0_XBnI7lFCP4ks5tphRhgaJpZM4TKOU- .
Hi Jesse, Were you ever able to create that Silva package? I also tried to make one, but was getting the same two errors (either I ran out of disk space or ram). I'm guessing that once you made the Silva package that it required a lot of ram just to use it. So I'm thinking I'll just go with the default green genes package, but I would love to know how your story ended.
Hi @SCarrSuperstar
I actually gave up in the end trying to make my own package, and used the greengenes package. This worked well for 16S, but unfortunately the coverage (across the regions of the molecule) for 18S was extremely uneven (I think the greengenes + 18S HMM model misses some parts of the longer 18S).
In the end I moved towards using phyloFlash (https://github.com/HRGV/phyloFlash) which (in my hands, using open ocean samples) retrieves more than 2x the 16S and 18S sequences and gives much more even coverage across the 18S. Recommend trying phyloFlash if your goal is just to retrieve SSU rRNA from metagenomes using SILVA.
Cheers, Jesse
Hello,
I'm trying to create a graftM package from the SILVA 16S database but am running into an unknown error. Here's an overview of what I've done:
graftM create --alignment SILVA_132_SSURef_Nr99_tax_silva_full_align_trunc.fasta --taxonomy taxmap_slv_ssu_ref_nr_132.txt.modified_for_graftM.uniq.tsv --sequences SILVA_132_SSURef_Nr99_tax_silva.fasta --verbosity 5 --tree tax_slv_ssu_132.tre
It took a really long time (> 1 week) and then I got this error:
Last I checked, it was on the mafft alignment step. Do you have any idea what might be causing this error? Running graftM on a cluster with ~1Tb of RAM so doubt memory is an issue and we have enough hard disk space available currently. Also, is there any way to restart the job from the temporary files generated?
Thanks very much for your help,
Jesse McNichol Fuhrman Lab, USC