csmiller / EMIRGE

EMIRGE reconstructs full length ribosomal genes from short read sequencing data.
37 stars 29 forks source link

Unexpected crash at 17th iteration #30

Open ppericard opened 7 years ago

ppericard commented 7 years ago

Hi,

I encountered an unexpected crash in the 17th iteration of an Emirge run. This occurred only with one dataset and I reproduced the bug on 2 different computers. I have no message in STDOUT telling me where is coming the pb from... I joined the run log.

Can you help me please ?

Thanks in advance

emirge.out.txt

epruesse commented 7 years ago
$ grep "reads with" emirge.out.txt | nl -v -1
    -1  # reads with at least one reported alignment: 32195 (0.10%)
     0  # reads with at least one reported alignment: 31287 (0.09%)
     1  # reads with at least one reported alignment: 31290 (0.09%)
     2  # reads with at least one reported alignment: 31281 (0.09%)
     3  # reads with at least one reported alignment: 31268 (0.09%)
     4  # reads with at least one reported alignment: 31401 (0.09%)
     5  # reads with at least one reported alignment: 31439 (0.09%)
     6  # reads with at least one reported alignment: 31542 (0.09%)
     7  # reads with at least one reported alignment: 31570 (0.09%)
     8  # reads with at least one reported alignment: 31600 (0.09%)
     9  # reads with at least one reported alignment: 31606 (0.09%)
    10  # reads with at least one reported alignment: 31623 (0.09%)
    11  # reads with at least one reported alignment: 31623 (0.09%)
    12  # reads with at least one reported alignment: 31636 (0.09%)
    13  # reads with at least one reported alignment: 31639 (0.09%)
    14  # reads with at least one reported alignment: 31652 (0.09%)
    15  # reads with at least one reported alignment: 31350 (0.09%)
    16  # reads with at least one reported alignment: 15 (0.00%)

About 31k of your reads mapped to the SSU and then to the list of candidates up until iteration 15. In iteration 15, 127 candidates were left, 24 of which were removed by clustering. The remaining 103 candidates had no reads mapping to them in iteration 16 and were culled. That led to an empty candidate file and the crash.

Do the sequences from the last iteration that worked make sense?

@csmiller: Do you have an idea as to why this sometimes happen? Should we at least catch this and print an appropriate error message?

ppericard commented 7 years ago

Thank you for your explanation, I now understand why it technically stopped.

However, it doesn't make any biological sense. This run was done with a sequencing dataset processed with quality cleaning and adapters removal. I also ran Emirge on the same dataset with raw reads and it gave me between 50 and 100 rRNA sequences depending on some parameters. And I know that this dataset has a low hundreds species. So I would expect to have at least a few dozens of scaffolds from an Emirge assembly of the cleaned reads. It makes no sense to have none. And, if I understand correctly how Emirge is working, adding more iterations should give me even more resolute results, not a complete disappearance of the reference sequences.

Could you look further to understand that behavior, because I'm certain this is not even close to the results I should have gotten from Emirge with this dataset ? If it can help you, I reran some other assembly runs and, for this dataset, the pb started to appear when setting the -j parameter to 1. I previously had no pb with such settings on other datasets.