PROBIC / mSWEEP

mSWEEP High-resolution sweep metagenomics using fast probabilistic inference
MIT License
12 stars 1 forks source link

segmentation fault / persistent memory issues during build log likelihood array #20

Closed clb21565 closed 1 year ago

clb21565 commented 1 year ago

Hi Tommi, trying to run mSWEEPs and I keep getting a segmentation fault.

mSWEEP-v1.6.1 abundance estimation Parsing arguments Reading the input files reading group indicators read 5580 group indicators reading pseudoalignments read 2612421 unique alignments Building log-likelihood array /cm/local/apps/slurm/var/spool/job1140150/slurm_script: line 12: 215562 Segmentation fault mSWEEP -t 62 --themisto-1 230219_fullFq_fullRefset/klebs/ali_1.aln.gz --themisto-2 230219_fullFq_fullRefset/klebs/ali_2.aln.gz -o 230219_fullFq_fullRefset/klebs/msweep -i 230201_run/klebs/ref_clu.txt --write-probs --gzip-probs

have tried to reduce threads and increase memory, but still no luck. any suggestions?

tmaklin commented 1 year ago

Hi Connor, v1.6.1 has a bug that can cause a segfault for deeply sequenced samples with lots of diversity. Can you try with v1.6.3? I think I fixed it there.

If v1.6.3 does not work then I'll speed up the release of v2, which introduces several changes to handle very large inputs better with lower resource use, but I still need to rewrite the documentation for v2.

clb21565 commented 1 year ago

still not working : / (base) seff 1141103 Job ID: 1141103 Cluster: tinkercliffs User/Group: clb21565/clb21565 State: FAILED (exit code 139) Nodes: 1 Cores per node: 128 CPU Utilized: 9-15:46:20 CPU Efficiency: 26.94% of 35-20:22:24 core-walltime Job Wall-clock time: 06:43:18 Memory Utilized: 271.77 GB Memory Efficiency: 27.40% of 992.00 GB FYI

thanks for all the continued tech support!

tmaklin commented 1 year ago

Oh that's frustrating! Thanks for persisting with trying to run the methods despite all the problems. Would you be willing to give the v2 prerelease code a go? I have been using it internally and it's almost ready for release, so if that works I'll prioritise getting it out asap.

You can install v2 with

git clone https://github.com/PROBIC/mSWEEP -b v2.0.0-prerelease2 mSWEEP-v2.0.0

cd mSWEEP-v2.0.0
mkdir build
cd build
cmake .. && make -j

You will need to make some changes to your mSWEEP call to get the same output. The following should work

mSWEEP -t 62 --themisto-1 230219_fullFq_fullRefset/klebs/ali_1.aln.gz --themisto-2 230219_fullFq_fullRefset/klebs/ali_2.aln.gz -o 230219_fullFq_fullRefset/klebs/msweep -i 230201_run/klebs/ref_clu.txt --print-probs --verbose | tr '\t' ',' | gzip -c > 230219_fullFq_fullRefset/klebs/msweep"_probs.csv.gz"

v2 contains changes that should make running large datasets easier, including supporting reading alignment files that have been compressed with alignment-writer. These can be 10-100x smaller than the default themisto output and are faster to read in, so might be worth looking into if you have limited disk space available to you.

If you only need the bins from mSWEEP and the probs.csv.gz file is not interesting to you, v2 supports creating the bins directly from mSWEEP by adding --bin-reads and --min-abundance and/or --target-groups to choose which bins you want similarly to mGEMS. You'll still need to run mGEMS extract to get the reads from the bins but skipping writing the probs.csv.gz file saves both time, disk space, and skips the slow reading of the pseudoalignment again in the mGEMS bin call.

The command to extract all bins with minimum abundance 0.01 would look like this

mSWEEP -t 62 --themisto-1 230219_fullFq_fullRefset/klebs/ali_1.aln.gz --themisto-2 230219_fullFq_fullRefset/klebs/ali_2.aln.gz -o 230219_fullFq_fullRefset/klebs/msweep -i 230201_run/klebs/ref_clu.txt --bin-reads --min-abundance 0.01

the mSWEEP --help gives up to date info about v2 code but the readme for that version is outdated.

clb21565 commented 1 year ago

thanks for the detailed response! I tried to install v2, but ran into a cmake(?) issue-- pivoted away for a bit but will be back at this soon.

tmaklin commented 1 year ago

great, please send me the error/log from cmake if the issue persists and I'll have a look.

tmaklin commented 1 year ago

I noticed a bug in the prerelease version of mSWEEP v2.0.0 that I pointed you to which could cause incorrect results or a segmentation fault for large alignments (number of reads x number of reference sequences > 2^32) depending on how msweep was compiled. This has been fixed in the v2.0.0-prerelease3 tag, so if you're trying that version you'd need to change the call to git clone above to

git clone https://github.com/PROBIC/mSWEEP -b v2.0.0-prerelease3 mSWEEP-v2.0.0
clb21565 commented 1 year ago

hi there, running into an error trying to update/cmake :

-- The C compiler identification is GNU 11.2.0 -- The CXX compiler identification is GNU 11.2.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - failed -- Check for working C compiler: /apps/easybuild/software/tinkercliffs-rome/GCCcore/11.2.0/bin/cc -- Check for working C compiler: /apps/easybuild/software/tinkercliffs-rome/GCCcore/11.2.0/bin/cc - broken CMake Error at /apps/easybuild/software/tinkercliffs-rome/CMake/3.21.1-GCCcore-11.2.0/share/cmake-3.21/Modules/CMakeTestCCompiler.cmake:69 (message): The C compiler

"/apps/easybuild/software/tinkercliffs-rome/GCCcore/11.2.0/bin/cc"

is not able to compile a simple test program.

It fails with the following output:

Change Dir: /projects/ciwars/pathogen_annotation/pathogen_workdir/klebs_contd/mSWEEP-v2.0.0/build/CMakeFiles/CMakeTmp

Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_26c08/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_26c08.dir/build.make CMakeFiles/cmTC_26c08.dir/build
gmake[1]: Entering directory `/projects/ciwars/pathogen_annotation/pathogen_workdir/klebs_contd/mSWEEP-v2.0.0/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_26c08.dir/testCCompiler.c.o
/apps/easybuild/software/tinkercliffs-rome/GCCcore/11.2.0/bin/cc    -o CMakeFiles/cmTC_26c08.dir/testCCompiler.c.o -c /projects/ciwars/pathogen_annotation/pathogen_workdir/klebs_contd/mSWEEP-v2.0.0/build/CMakeFiles/CMakeTmp/testCCompiler.c
Linking C executable cmTC_26c08
/apps/easybuild/software/tinkercliffs-rome/CMake/3.21.1-GCCcore-11.2.0/bin/cmake -E cmake_link_script CMakeFiles/cmTC_26c08.dir/link.txt --verbose=1
/apps/easybuild/software/tinkercliffs-rome/GCCcore/11.2.0/bin/cc -rdynamic CMakeFiles/cmTC_26c08.dir/testCCompiler.c.o -o cmTC_26c08 
/usr/bin/ld.gold: --push-state: unknown option
/usr/bin/ld.gold: use the --help option for usage information
collect2: error: ld returned 1 exit status
gmake[1]: *** [cmTC_26c08] Error 1
gmake[1]: Leaving directory `/projects/ciwars/pathogen_annotation/pathogen_workdir/klebs_contd/mSWEEP-v2.0.0/build/CMakeFiles/CMakeTmp'
gmake: *** [cmTC_26c08/fast] Error 2

FWIW -- our cluster has multiple versions of cmake/gcc, i've been using the most up to date:

CMake/3.21.1-GCCcore-11.2.0

this looks like some sort of dependency is missing?

tmaklin commented 1 year ago

this seems like an issue with your cluster environment, I think the relevant lines are

/usr/bin/ld.gold: --push-state: unknown option
/usr/bin/ld.gold: use the --help option for usage information
collect2: error: ld returned 1 exit status

which means your compiler toolchain is somehow incorrectly loaded. You could give the older versions of GCC a try -- GCC 7.x.x and newer have C++17 support and should be fine.

Anyway I've put a precompiled binary for Linux up as a draft release at https://github.com/PROBIC/mSWEEP/releases/tag/v2.0.0-prerelease3 which you can use.

I've also identified the bug in v1.6.3 now (same issues with the overflow that I mentioned earlier), I'll see if it's possible to make a minor release fixing this.

clb21565 commented 1 year ago

FYI - with the pre-complied binary, it is working now! :-] Unfortunately, it failed without an error message when I tried to use it on our deep sequenced data (but it worked on smaller data). Not to worry though, I have changed my strategy for analyzing these data to be a bit kinder to the HPC and mGEMS/mSWEEP and I think it is a stronger way of doing it. Thanks for all the tech support! Closing for now.

tmaklin commented 1 year ago

good to hear, out of curiosity how many reference sequences and reads did you have in the case that did not work?

clb21565 commented 1 year ago

the sample that failed was ~800 million reads against ~5,500 references

clb21565 commented 1 year ago

Also, I noticed that the mSWEEP output now ends in a 'tsv' rather than a 'csv'

just wanted to confirm this isn't something I did on my end

tmaklin commented 1 year ago

thanks, I've never run the method on > 100 million reads so I'll have to create a test case for that and see what's causing the crash.

The csv -> tsv change is intentional in v2 so that the output files have consistent formatting. If you want to convert the tsv file into the old format you can do that by running cat msweep_output.tsv | tr '\t' ',' > msweep_output.csv