SebastianMeyer1989 / UltraPlexer

The UltraPlexer is a kmer-based tool that allows assigning non-barcoded long-read sequences generated by the Oxford Nanopore Technology to isolates, by matching them to barcoded short-read sequences generated by Illumina Technology.
MIT License
10 stars 2 forks source link

Memory requirements #6

Closed arredondo23 closed 1 year ago

arredondo23 commented 3 years ago

Dear @AlexanderDilthey @SebastianMeyer1989 ,

Thanks for this fantastic tool and its brilliant publication https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01974-9

I would like to test the approach on a set of E. coli genomes that were multiplexed and barcoded into the same MinION flow cell for which we have Illumina data. During the installation of Ultraplexer, I ran into the problems installing the Perl module Math::GSL as mentioned in another closed issue but I managed to run the docker version and even created a singularity container to be run on an HPC environment.

I could successfully predict the example test so I think installation is fine. However, with my dataset I am running into the following two problems:

Fri Aug 13 10:25:53 2021 List of colours: /opt/UltraPlexer/cortex_temp/binariesList_pool_19 (contains one filelist per colour). Load data into consecutive colours starting at 0 Command terminated by signal 9 Command being timed: "/opt/cortex-1.0.5.21/bin/cortex_var_31_c20 --mem_height 20 --mem_width 100 --colour_list /opt/UltraPlexer/cortex_temp/binariesList_pool_19 --kmer_size 19 --align /opt/UltraPlexer/cortex_temp/readsList_pool_19,no --align_input_format LIST_OF_FASTA --max_read_len 222525" User time (seconds): 16.80 System time (seconds): 4.46 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:21.60 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 10291316 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 181 Minor (reclaiming a frame) page faults: 5129264 Voluntary context switches: 367 Involuntary context switches: 2207 Swaps: 0 File system inputs: 1824080 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

Do you have any ideas about how can I circumvent this? Should the algorithm work if I indicate just 1 sample in --samples_file and run Ultraplexer 96 independent times?

Many thanks for your help!

Sergio

SebastianMeyer1989 commented 3 years ago

Good Morning Sergio, thank you for trying out the UltraPlexer. I hope I can help you with your query.

Since the example worked, I also believe, that your installation should be fine.

  1. The number of samples: We at first only worked with up to 60 isolates. Hence the limitations of the script. Later we also tried higher numbers. For this a new cortex binary had to be downloaded. You could either look for a binary file that suits your number of isolates, or use the file I just uploaded to the master branch of the UltraPlexer. It is called "cortex_var_63_c172" and works for up to 172 colors/isolates. We also used this for runs with 100 isolates and it worked fine. You need to copy it into your cortex binary folder. In either case you need to change the UltraPlexer.pl script in lines 1447 - 1457 to recognize the new binary. Just add an

    elsif($ncolours <= 172) { return "${cortex_bin_dir}/cortex_var_63_c172"; }

after the 60 isolates part into your installation of the UltraPlexer script.

Using only one sample in the sample-file and do 96 runs should not really work, since each run predicts where the reads belong, based on comparing the identity of the reads to the different samples. But if we manage to solve the memory issue, this should not be a problem.

  1. The memory issue: In the time command output I read "Maximum resident set size (kbytes): 10291316," which is about 10gb memory. So the command used ~10gb memory before it got killed. Ho much memory is available for the UltraPlexer run? Only this 10gb?

The provided example (which worked on your machine) is very small and only needs <10gb memory. But a run with more isolates/data likely needs a lot more memory. In our publication we tried a set with 48 samples (and a lot of reads). This used 70 CPU hours and 175 GB of memory for the demultiplexing step.

So you probably want to try to provide more memory at first.

Another known issue is the one also described on the GitHub page. If you still run into memory allocation problems after providing more memory, you could try this:

"Known issue: If the Ultraplexing run produces an an error like could not allocate hash table of size 209715200 Error: Giving up - unable to allocate memory for the hash table and subsequently stops due to not finding a certain file, pls try to add the option "--allSamples_cortex_height 20" to the first Utraplexer call. This error occurs, if the algorithm can not allocate the memory needed (by estimation), to store all calculated data. The before mentioned option reduces the memory allocated by the UltraPlexer, which in most cases solves the problem."

Please note, that this will probably not work with only 10gb of memory.

Please let me know if anything of this helped and if you have further issues.

Sebastian

arredondo23 commented 3 years ago

Thanks, Sebastian for the detailed answer!

I will pull the repo and modify those lines in Ultraplexer.pl to allow the usage of > 60 samples. I will check as well, if there is any possibility to increase the memory assigned to run the singularity image, thanks for pointing out the memory and CPU hours that I should expect to run the job.

Will get back to you if I have any other issues/questions.

Thanks!

Sergio