Run time and disk space required in version 0.5

waltergallegog commented 1 year ago

Hello, In the past I was able to run succesfully nanomonsv version 0.4 with my data (WGS of around 70GB for control and tumor each).

I updated to version 0.5 and now I have 3 problems:

It is taking much longer to run (v0.4 run time was around 2 hours, v0.5 is 8 hours but I suspect it exited earlier with an error)

At some point during the process there was an error caused by disk space:

02/07/2023 21:14:58 - nanomonsv.run - INFO - Counting the number of supprting read for the tumor by realignment of SV candidate segments (0) sort: write failed: /tmp/sortRQo79k: No space left on device
02/07/2023 22:47:22 - nanomonsv.run - INFO - Counting the number of supprting read for the control by realignment of SV candidate segments (0) sort: write failed: /tmp/sortrb50YC: No space left on device

The output file does not contain any SVs ( I assume because of the disk space issue). Only the vcf header is present.

To make some comparisons, I have run nanomonsv v0.4 and v0.5 again, with a smaller dataset containing only data from chromosome 1. This are the results:

Metric	V0.4	V0.5
Time for parsing tumor	3 minutes	3 minutes
Time for parsing control	3 minutes	3 minutes
Time for get	12 minutes	92 minutes
Number of variants detected	15	30

Is the increase in runtime expected? Is there any way to mitigate it ?
Is the threads option safe? I I'm currently rerunning with 28 threads using the option --threads 28, but it does not seem to be much improvement. The total CPU utilization of the process is around 140% according to htop.
Have you noticed any similar disk space issues with version v0.5? do you think the disk space problems are related to the number of files (inodes), or the disk size available?
I would like to run v0.5 with my entire WGS data, as the new version is detecting more variants, but the run time is prohibitively long.

Thanks for your feedback.

friend1ws commented 1 year ago

Thank you very much for the interest in nanomosv.

Is the increase in runtime expected? Is there any way to mitigate it ?

From v0.5, we loosened the threshold of the size of SV from 100 to 50. This will increase the number of candidate SVs for investigation. You may explicitly set --min_indel_size to 100. Or I recommend adding control panel (which will remove the common SVs observed in ppulation beforehand).

Is the threads option safe? I I'm currently rerunning with 28 threads using the option --threads 28, but it does not seem to be much improvement. The total CPU utilization of the process is around 140% according to htop.

Could you try --processes? Currently, we do not recommend to use --threads.

Have you noticed any similar disk space issues with version v0.5? do you think the disk space problems are related to the number of files (inodes), or the disk size available?

The problem is the partition of /tmp directory. You could explicitly set the TMPDIR to some directory of your home directory.

export TMPDIR={the appropriate directory of your home disk}

I would like to run v0.5 with my entire WGS data, as the new version is detecting more variants, but the run time is prohibitively long.

I would appreciate it if you could try the items I pointed out above and let me know if the problem persists.

waltergallegog commented 1 year ago

Thank you for the quick feedback.

I have tried the --process option with the small dataset of 1 chromosome and now the cpu utilization is as expected. The only problem is that two of the variants that were detected with 1 process are not detected when using the multi process option (28 in this case). Attached the CSV files with the variants. The missing ones are
```
1   143254717   d_110   A   <DEL>   .   Too_low_VAF
1   143264704   i_388   C   <INS>   .   Too_low_VAF
```

tumor.nanomonsv.result_human_28process.csv tumor.nanomonsv.result_human_1process.csv

The command I used is:

nanomonsv get tumor tumor.bam ref.fa --control_prefix ctrl --control_bam ctrl.bam --use_racon --processes 28

Let me know if you can use more information (like intermediate files, or rerun with some debug option enabled etc)

Thanks for your advise on the min length parameter and the use of a control panel. I will check the values for my use case.
- Thanks for the info on the tmp folder. I will use this variable to point to a partition with more space.

BR.

friend1ws commented 1 year ago

Thanks. I will also take a look at some of the data here to see if the results are different depending on the --process settings. Just in case, the missing variants had the flag of Too_low_VAF and I guess you can safely remove them.

friend1ws / nanomonsv

Run time and disk space required in version 0.5 #30