bcgsc / NanoSim

Nanopore sequence read simulator
Other
246 stars 57 forks source link

How many threads and memories required at training stage? #236

Open yaoxkkkkk opened 1 month ago

yaoxkkkkk commented 1 month ago

Thank you for your development. I am using Nanosim to simulate ONT data, I use 32 threads and 256GB memory to run training stage, but it reported out of memory error. The command is

    read_analysis.py genome \
        -i ZJYY_ont_filter.fq.gz \
        -rg nd.asm.fasta \
        -o ${home_dir}/01-data/ONT/${species}_training \
        --fastq \
        -t 32

The ZJYY_ont_filter.fq.gz dataset stat is

file                   format  type   num_seqs         sum_len  min_len   avg_len  max_len
ZJYY_ont_filter.fq.gz  FASTQ   DNA   1,544,988  43,308,647,713    2,000  28,031.7  246,468

And when I run the command without --fastq parameter, the training step could be finished.

lcoombe commented 1 month ago

Hi @yaoxkkkkk,

The amount of memory required will really depend on the dataset that you are training on. On my end, training using --fastq with the HG002 ONT dataset used for the latest pre-trained models required around 263 GB of RAM - so that could be why you are seeing those errors. If you want to use --fastq, some other options could be to use our pre-trained model, or try training using a subset of your reads.

Thank you for your interest in NanoSim! Lauren