AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

Error: no input sequences to analyze #20

Closed aelbehery closed 4 years ago

aelbehery commented 4 years ago

During running VIBRANT yesterday, I got the following error:

Traceback (most recent call last):
  File "./VIBRANT-master/scripts/VIBRANT_extract_nucleotide.py", line 23, in <module>
    db_dict.update({name:seq})
MemoryError
Traceback (most recent call last):
  File "./VIBRANT-master/scripts/VIBRANT_extract_nucleotide.py", line 23, in <module>
    db_dict.update({name:seq})
MemoryError
Traceback (most recent call last):
  File "./VIBRANT-master/scripts/VIBRANT_extract_nucleotide.py", line 23, in <module>
    db_dict.update({name:seq})
MemoryError

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

Error:  no input sequences to analyze.

But it did not abort; it continued normally and finished the analysis. I would like to know what this error means and whether it will have any influence on the final results.

I ran the command with the option -t 24 .

I repeated the analysis one more time, but I got the same error. What makes me worried is that the number of detected phage contigs between the two runs was not exactly the same (201316 contigs for the first run and 200968 contigs for the second) or is this normal?

I guess this could be a bug related to parallelization because when I repeated the run one third time without hyperthreading, I did not get this message, but of course it's taking ages and not finished yet! What do you think?

KrisKieft commented 4 years ago

Hi,

I've not seen this issue before but within the top section of the error it says MemoryError. Based on the number of phages you're finding (>200,000) you're working with a very large dataset or at least a large virome dataset. VIBRANT reads a lot of data into memory, specifically this part that you had an error with. At this step VIBRANT is reading all the scaffolds into memory (a Python dictionary) temporarily in order to parallelize the runs. I think you may have ran out of memory (as opposed to storage). Unfortunately the only solution I have for you is to split your input file into multiple parts and run them separately so the memory burden isn't as high. If this assessment is true then the reason you get a different number of phages is because different parallel runs quit due to memory limits (Error: no input sequences to analyze). So whatever data was in those files that errored out would be lost in further analysis. I would suggest running on CyVerse but the support team is having difficulty getting v1.2.0 installed. Hope that helps.

Kris

aelbehery commented 4 years ago

Hi Kris,

Thanks for your quick response. It's really helpful and makes sense. So, how large do you recommend a single file should be after splitting? I have 28 cores with 2 hyperthreads per core and 64GB RAM.

Ali

KrisKieft commented 4 years ago

This error occurs very early in the script so you may be able to start a run then track memory usage. This will help you determine what file size is tractable on your machine. I have never had this problem but I'm also working with 1TB of RAM, though 64GB is still a lot. You can use the Linux command free. This will pop up your memory usage as a snapshot in time (you'll have to keep hitting free to track over time). free -h will give you the info in readable terms. Your 64GB RAM should be in the leftmost column. "Swap" memory to my knowledge is just a small overflow pool.

image

aelbehery commented 4 years ago

Thanks Kris; I split the input file into 10 and it went well with no errors.