csmiller / EMIRGE

EMIRGE reconstructs full length ribosomal genes from short read sequencing data.
37 stars 29 forks source link

EMIRGE should exit more gracefully if 0 sequences are reconstructed #9

Open csmiller opened 11 years ago

csmiller commented 11 years ago

If EMIRGE cannot reconstruct any sequences due to low read coverage, it currently gets to the clustering stage where usearch throws a segfault and emirge exits without an informative error message.

cmorganl commented 8 years ago

Hi @csmiller, I think I'm encountering an identical, or very related, issue. If there are fewer than 50 reads aligning to the SSU database (I'm using the default downloaded by emirge) I get:

Clustering sequences for iteration 0 at Thu Apr  7 10:17:19 2016...
    cluster threshold = 0.970
[fai_load] build FASTA index.
[fai_load] fail to open FASTA index.
Traceback (most recent call last):
  File "/usr/local/bin/emirge.py", line 4, in <module>
    __import__('pkg_resources').run_script('EMIRGE==0.60.3', 'emirge.py')
  File "/Library/Python/2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Library/Python/2.7/site-packages/pkg_resources/__init__.py", line 1490, in run_script
    exec(script_code, namespace, namespace)
  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1697, in <module>

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1688, in main

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 1439, in do_iterations

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 500, in do_iteration

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 807, in cluster_sequences

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 829, in cluster_sequences2

  File "/Library/Python/2.7/site-packages/EMIRGE-0.60.3-py2.7-macosx-10.8-x86_64.egg/EGG-INFO/scripts/emirge.py", line 769, in write_consensus_with_mask

  File "pysam/cfaidx.pyx", line 113, in pysam.cfaidx.FastaFile.__cinit__ (pysam/cfaidx.c:2025)
  File "pysam/cfaidx.pyx", line 149, in pysam.cfaidx.FastaFile._open (pysam/cfaidx.c:2845)
IOError: could not open file `/Users/cmorganlang/EMIRGE_16S_prediction/BioOHT_76.AD-318_B19_Frak/iter.00/iter.00.cons.fasta`

It looks like this stems from the pysam call within write_consensus_with_mask judging by the location in the run log at which the pipeline breaks.

I am trying to assemble the 16S genes from SAGs and EMIRGE is working successfully for about half the SAGs. I guess it can't fully reconstruct a single 16S from the 30-45 reads that do align. Would it be worth decreasing the MIN_DEPTH parameter to 1 for these samples or could this produce an erroneous 16S assembly?

csmiller commented 8 years ago

If you send or post the full output I can verify, but this looks like you are correct: no single 16S is reconstructed.

My first question is why you are using EMIRGE for a SAG? Does the 16S not pop out of the assembly?

Your approach to lower the MIN_DEPTH param. might work, though I would try lowering incrementally. With MIN_DEPTH=1, I worry the large number of erroneous mappings might make the task harder for EMIRGE. You could also try running emirge_amplicon.py if number of reads / SAG is not too high. Or preselect the 16S reads and try emirge_amplicon.py.

Could you post this on emirge google group? You might get other feedback, and I bet others would be curious to find out how these various options worked out for you.

Chris

cmorganl commented 8 years ago

Until recently, as far as I'm aware, several institutions experienced contamination issues during SAG processing where cells from one well would be carried over into other wells prior to amplification. I'm using EMIRGE to screen the SAGs and see if there are multiple 16S genes present, and it seems to be doing a good job so far. The 16S sequences would otherwise become convoluted during assembly normal DBG assembly. EMIRGE also provides the candidate taxa ID to make it even easier :)

Sure, I'll post my findings to the google group.