"Error: piscem mapping failed with exit status ExitStatus(unix_wait_status(139))"

wmacnair commented 11 months ago

Hi simpleaf team

First off, a thank you: we've been using simpleaf to do lots of mapping for a single cell atlas project, and it has been making everything faster and smoother 🚀 (especially being able to S/U/A counts super fast). This is awesome 😄

However, there are a small proportion of samples where it doesn't work, and we can't really see any reason why. I've pasted some example output below.

This doesn't appear to driven by e.g. tiny file sizes due to RNAseq not working:

-rw-r----- 1 macnairw games 1.2G Jul 23 21:20 WM177_POOL4_S9_L001_I1_001.fastq.gz
-rw-r----- 1 macnairw games 3.8G Jul 23 21:21 WM177_POOL4_S9_L001_R1_001.fastq.gz
-rw-r----- 1 macnairw games 8.5G Jul 23 21:22 WM177_POOL4_S9_L001_R2_001.fastq.gz
-rw-r----- 1 macnairw games 1.2G Jul 23 21:22 WM177_POOL4_S9_L002_I1_001.fastq.gz
-rw-r----- 1 macnairw games 3.7G Jul 23 21:22 WM177_POOL4_S9_L002_R1_001.fastq.gz
-rw-r----- 1 macnairw games 8.4G Jul 23 21:23 WM177_POOL4_S9_L002_R2_001.fastq.gz

Right now it's difficult to diagnose what is going on... Most samples in this experiment work, a handful don't, and I don't really see anything I can work with from the error message. Any suggestions??

Thanks! Will

(Tagging @marusakod as she has had similar problems.)

simpleaf --version
# simpleaf 0.14.1

# simpleaf configuration
export ALEVIN_FRY_HOME="/projects/site/pred/neurogenomics/resources/scprocess_data/data/alevin_fry_home"
simpleaf set-paths

# mess about with input strings
R1_fs=$(echo ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R1_001.fastq.gz ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R1_001.fastq.gz | sed "s/ /,/g")
R2_fs=$(echo ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R2_001.fastq.gz ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R2_001.fastq.gz | sed "s/ /,/g")

# simpleaf quantfication
simpleaf quant \
  --reads1 $R1_fs \
  --reads2 $R2_fs \
  --threads 16 \
  --index $ALEVIN_FRY_HOME/human_2020-A_splici/index \
  --chemistry 10xv3 --resolution cr-like \
  --expected-ori fw \
  --t2g-map $ALEVIN_FRY_HOME/human_2020-A_splici/index/t2g_3col.tsv \
  --unfiltered-pl $CELLRANGER_DIR/3M-february-2018.txt --min-reads 1 \
  --output ./Macnair_2022/output/ms01_alevin_fry/af_WM177

# 2023-07-24T08:42:52.328031Z  INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /projects/site/pred/neurogenomics/resources/scprocess_data/conda/af/bin/piscem map-sc --index /projects/site/pred/neurogenomics/resources/scprocess_data/data/alevin_fry_home/human_2020-A_splici/index/piscem_idx --threads 8 -o ./Macnair_2022/output/ms01_alevin_fry/af_WM177/af_map -1 ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R1_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R1_001.fastq.gz -2 ./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R2_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R2_001.fastq.gz --geometry chromium_v3
# 2023-07-24T08:43:31.448195Z ERROR simpleaf::utils::prog_utils: command unsuccessful (signal: 11 (SIGSEGV) (core dumped)): "/projects/site/pred/neurogenomics/resources/scprocess_data/conda/af/bin/piscem" "map-sc" "--index" "/projects/site/pred/neurogenomics/resources/scprocess_data/data/alevin_fry_home/human_2020-A_splici/index/piscem_idx" "--threads" "8" "-o" "./Macnair_2022/output/ms01_alevin_fry/af_WM177/af_map" "-1" "./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R1_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R1_001.fastq.gz" "-2" "./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R2_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R2_001.fastq.gz" "--geometry" "chromium_v3"
# 2023-07-24T08:43:31.449914Z ERROR simpleaf::utils::prog_utils: stderr :
# ====
# 2023-07-24T08:42:52.373983Z  INFO piscem: cmd: ["sc_ref_mapper", "-i", "/projects/site/pred/neurogenomics/resources/scprocess_data/data/alevin_fry_home/human_2020-A_splici/index/piscem_idx", "-g", "chromium_v3", "-1", "./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R1_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R1_001.fastq.gz", "-2", "./Macnair_2022/data/fastqs/WM177_POOL4_S9_L001_R2_001.fastq.gz,./Macnair_2022/data/fastqs/WM177_POOL4_S9_L002_R2_001.fastq.gz", "-t", "8", "-o", "./Macnair_2022/output/ms01_alevin_fry/af_WM177/af_map"]
# [2023-07-24 10:42:52.376] [info] loading index from /projects/site/pred/neurogenomics/resources/scprocess_data/data/alevin_fry_home/human_2020-A_splici/index/piscem_idx
# [2023-07-24 10:42:53.777] [info] done loading index
# processed (13000000) reads; (11168505) had mappings.====
# Error: piscem mapping failed with exit status ExitStatus(unix_wait_status(139))

rob-p commented 11 months ago

Hi @wmacnair,

Thanks so much for the detailed report, and for your kind words regarding simpleaf. This is exactly the type of effect we hoped it would have :).

So, the error looks to arise from piscem-cpp (the mapper, and one of the few components we have not yet migrated fully over to rust yet, though @TheJasonFan is working on rectifying that). What's interesting is that it appears to be a segfault, but shows up after the mapping is finished (we get to see how many reads were processed, and how many had mappings, etc.).

We'd be happy to take a look, figure out what is going on, and try and fix this. A couple of quick questions. Is it always the same samples that fail? If you run the sample again, does it fail? Would it be possible to share one of the failing samples privately just for the purposes of reproducing and fixing this issue? If not, would it be possible to see if there is a small subset of the sample on which you can reproduce the issue and share that? Typically, with these types of things it's usually some particular aspect of the data that triggers a corner case in the code that "typical" data doesn't, leading to the segfault (likely memory error).

Thanks! Rob

wmacnair commented 11 months ago

Hi @rob-p

Developing methods can sometimes be a thankless task, so I try to give some encourage when I can :)

Re your questions:

It is always the same (small set of) samples that fails.
Rerunning gives the same error.

These samples are not yet published, so sharing is a bit more tricky, but @marusakod has had the same error (again, repeatedly) on the BioSample SAMN13262712 in this repo: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA589018&o=acc_s%3Aa There should be 3 sets (R1 + R2) of fastq files. (Thanks @marusakod!)

Hopefully that is enough to start with. I'll also have a look at subsampling my files and see if I can reproduce the error.

Cheers! Will

Hopefully that should help. I'll see if I can generate

wmacnair commented 11 months ago

Ah Maruša suggests that sample 10X145-6 here might be easier to work with, as that already has R1 and R2 files labelled: https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/

rob-p commented 11 months ago

Thanks! I'll give this a look and report back when I have some idea whats up.

In the meantime, could you also please share the command and files you used to generate the index/reference?

Thanks, Rob

rob-p commented 11 months ago

@wmacnair & @marusakod,

I grabbed one of the samples linked above 10X145-6_S19_L003, and everything completed successfully on that sample. I noted, however, that the link:

https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/

has several samples. I will start looking at the others, but if you could point me at a particular problem sample, that would be extra useful!

Thanks, Rob

wmacnair commented 11 months ago

I used the 10x reference genome, here.

These are the commands I used:

#!/bin/bash
#BSUB -J simpleaf_build_index_human
#BSUB -W 24:00
#BSUB -n 24
#BSUB -R "rusage[mem=8192]"
#BSUB -R "span[hosts=1]"
#BSUB -q long
#BSUB -eo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.err
#BSUB -oo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.out

# how to call:
# bsub < /home/macnairw/packages/scProcess/scripts/simpleaf_build_index_human.sh

# set up environment
ml purge
ml Anaconda3/2021.05
conda activate af
# conda install numpy=1.23

ulimit -n 2048

# simpleaf configuration
export ALEVIN_FRY_HOME="/home/macnairw/packages/scProcess/data/alevin_fry_home"
simpleaf set-paths

# change working directory to somewhere not crazy
cd $ALEVIN_FRY_HOME

# set up this build
PROJ_DIR="/home/macnairw/packages/scprocess/data"
REF_DIR="$PROJ_DIR/reference_genome/refdata-gex-GRCh38-2020-A"
IDX_DIR="$PROJ_DIR/alevin_fry_home/human_2020-A_splici"

# simpleaf index
simpleaf index \
  --output $IDX_DIR \
  --fasta $REF_DIR/fasta/genome.fa \
  --gtf $REF_DIR/genes/genes.gtf \
  --rlen 91 \
  --threads 24 \
  --use-piscem

conda deactivate

wmacnair commented 11 months ago

Ok interesting... What if you try all of the samples there, in one run? I think that's what Maruša did when it failed for her.

@wmacnair & @marusakod,

I grabbed one of the samples linked above 10X145-6_S19_L003, and everything completed successfully on that sample. I noted, however, that the link:

https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/

has several samples. I will start looking at the others, but if you could point me at a particular problem sample, that would be extra useful!

Thanks, Rob

rob-p commented 11 months ago

I tried the smaller samples in HL377BGX9 and they processed as well. I am downloading the largest runs HHT7VDSXX now and will try them. If they all work individually, then I will try everything together in a single (giant) run. I should also note that I'm using my own local version of piscem (which should match the version on conda), but if I can't repro then I will switch to the conda executable directly.

--Rob

marusakod commented 11 months ago

Hi @rob-p

Will is right, for sample 10X145-6 I used all fastq files in https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/, which is 24 files in total. You could try with sample 10X269-2( https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X269-2_S01_L003-001.fastq.tar) which only has 3 fastq files and gave the same error.

rob-p commented 11 months ago

Thanks @marusakod,

I was unable to reproduce this locally with my native build of piscem (on the original 10X145-6 data) — I'm trying now with the conda install. I will also take a look at the 10X269-2 data you suggest above.

--Rob

rob-p commented 11 months ago

So I ran on everything in 10x145-6, and I did so with the version of piscem obtained from conda. Specifically, here is what I have for piscem:

✦ ❯ which piscem
/home/rob/mambaforge/envs/piscem/bin/piscem
✦ ❯ piscem --version
piscem 0.6.0

One thing worth noting is that I don't seem to have 24 files when I grabbed everything in the directory for 10X145-6. Rather, I have 16 files total (8 runs, with a pair of reads for each run). I parsed out the command generated by simpleaf to obtain this list of files:

>>> reads1
['10X145-6_S13_L003_R1_001.fastq.gz', '10X145-6_S13_L004_R1_001.fastq.gz', '10X145-6_S19_L003_R1_001.fastq.gz', '10X145-6_S19_L004_R1_001.fastq.gz', '10X145-6_S10_L0
01_R1_001.fastq.gz', '10X145-6_S10_L002_R1_001.fastq.gz', '10X145-6_S10_L003_R1_001.fastq.gz', '10X145-6_S10_L004_R1_001.fastq.gz']
>>> reads2
['10X145-6_S13_L003_R2_001.fastq.gz', '10X145-6_S13_L004_R2_001.fastq.gz', '10X145-6_S19_L003_R2_001.fastq.gz', '10X145-6_S19_L004_R2_001.fastq.gz', '10X145-6_S10_L0
01_R2_001.fastq.gz', '10X145-6_S10_L002_R2_001.fastq.gz', '10X145-6_S10_L003_R2_001.fastq.gz', '10X145-6_S10_L004_R2_001.fastq.gz']

Can you see anything obvious missing from here?

One other thing worth asking — though I doubt it's an issue — is that these runs where things seem to be failing look to be pretty big (i.e. when using all of the data from 10X145-6 or the files in 10X269-2). As such, there will be a non-trivial amount of output prior to the final count matrix. For example, when running on all of 10X145-6, I get the following:

Is there any possibility you're resource constrained in terms of the disk allocated to the output, and that the mapper may be failing because it can't properly write the complete output files?

--Rob

marusakod commented 11 months ago

Hi @rob-p

sorry, my bad, I was also counting index files. 16 is the correct number of R1 + R2 files for sample 10X145-6. Disk resources should't be a problem. In fact, I had no problems with most of the equally big samples in https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/. Also, the issue appeared with smaller runs (BioSample SAMN13262712 in https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA589018&o=acc_s%3Aa)

Maruša

wmacnair commented 11 months ago

Hi @rob-p , working through your comments and suggestions.

piscem version check:

which piscem
# /projects/site/pred/neurogenomics/resources/scprocess_data/conda/af/bin/piscem
piscem --version
# piscem 0.6.0

rob-p commented 11 months ago

Thanks @wmacnair and @marusakod,

The first order of business on my side is really just to be able to reproduce the error.

@marusakod — the samples here are much smaller, but there are many of them; ~is there a particular file or subset of files that you find trigger the issue?~. [edit: Sorry, I was just being dense, I see that you listed the actual sample ids SAMN13262712 already. I'll check these out!]

So far, I have not been able to trigger this with either my native built piscem nor the piscem from bioconda. My next plan is to try on a different machine to see if I can trigger the issue, though the bioconda build should really be accounting for most meaningful machine-to-machine variability. Once I'm able to find a sample that I can use to trigger the issue, I think diagnosis should not be too hard. The plan is to directly run the offending piscem command (i.e. not through simpleaf), possibly under gdb to see where things are going awry.

--Rob

wmacnair commented 11 months ago

I've checked, and I can share a file with you where it fails for me. What do you recommend for sharing massive files? 😅

Will

rob-p commented 11 months ago

I often use Box or Google Drive or some such. Whatever works ;P. Once you know where you want to put it, you can e-mail me at rob@cs.umd.edu with the relevant info!

Thanks

wmacnair commented 11 months ago

I'm working on this, but I'm having to configure some things for the first time so could be a little while...

rob-p commented 11 months ago

@marusakod — I was again not able to reproduce this on SAMN13262712. Of course, I did have to run it twice because I didn't realize it was 10xV2 chemistry and so the first time I get 0 mappings :).

I'll try again with these samples on another machine. They are easier to work with because they're pretty small. Here was my whole run; let me know if the command looks the same as yours (modulo paths):

> simpleaf quant --reads1 $R1 --reads2 $R2 --threads 24 --index ../piscem_index/index --chemistry 10xv2 --unfiltered-pl --resolution cr-l
ike --expected-ori fw --min-reads 1 --output SAMN13262712_quant
2023-07-25T18:37:05.977749Z  INFO simpleaf::simpleaf_commands::quant: found local t2g file at ../piscem_index/index/t2g_3col.tsv, will attempt to use this since none
 was provided explicitly
2023-07-25T18:37:05.986416Z  INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /home/rob/mambaforge/envs/piscem/bin/piscem map-sc --index ../piscem_index/
index/piscem_idx --threads 24 -o SAMN13262712_quant/af_map -1 SAMN13262712/SRR10433215_1.fastq.gz,SAMN13262712/SRR10433216_1.fastq.gz,SAMN13262712/SRR10433217_1.fast
q.gz -2 SAMN13262712/SRR10433215_2.fastq.gz,SAMN13262712/SRR10433216_2.fastq.gz,SAMN13262712/SRR10433217_2.fastq.gz --geometry chromium_v2
2023-07-25T18:41:34.766218Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:34.766302Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /home/rob/alevin-fry/target/release/alevin-fry generate-p
ermit-list -i SAMN13262712_quant/af_map -d fw --unfiltered-pl /home/rob/.alevin_fry_home/plist/10x_v2_permit.txt --min-reads 1 -o SAMN13262712_quant/af_quant
2023-07-25T18:41:44.254231Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:44.254307Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry collate cmd : /home/rob/alevin-fry/target/release/alevin-fry collate -i SAMN13262712
_quant/af_quant -r SAMN13262712_quant/af_map -t 24
2023-07-25T18:41:50.039382Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:50.039433Z  INFO simpleaf::simpleaf_commands::quant: cmd : "/home/rob/alevin-fry/target/release/alevin-fry" "quant" "-i" "SAMN13262712_quant/af_quan
t" "-o" "SAMN13262712_quant/af_quant" "-t" "24" "-m" "../piscem_index/index/t2g_3col.tsv" "-r" "cr-like"
2023-07-25T18:42:04.433691Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)

rob-p commented 11 months ago

Ok, on a different machine I have found success (in failing :P). I can reproduce the segfault on this sample.

The backtrace is, right now, unfortunately, not super useful. Interestingly, it seems to suggest an illegal instruction! I'm not sure if this is the same cause you are encountering or not.

Program received signal SIGILL, Illegal instruction.
0x00005555556d83ea in char const* fmt::v9::detail::parse_replacement_field<char, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&>(char const*, char const*, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) bt
#0  0x00005555556d83ea in char const* fmt::v9::detail::parse_replacement_field<char, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&>(char const*, char const*, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&) ()
#1  0x00005555556d8e6e in void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) ()
#2  0x00005555556e63d9 in mindex::reference_index::reference_index(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#3  0x000055555575fb44 in run_pesc_sc ()
#4  0x00005555556264ff in piscem::main::hf73a101ed4e5a128 ()
#5  0x0000555555643523 in std::sys_common::backtrace::__rust_begin_short_backtrace::h0759eaefd4640ac1 ()
#6  0x000055555563fc7d in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h5773aadca783a79d ()
#7  0x0000555555666bdb in std::rt::lang_start_internal::h078acd489417d3c1 ()
#8  0x00005555556331a5 in main ()
(gdb)

wmacnair commented 11 months ago

Great! Nice to be celebrating a failure :D

I think we get "SIGSEGV". But at least we're getting something similar on the same sample...

rob-p commented 11 months ago

So I get a segfault when I run it normally, the illegal instruction is only via GDB.

Interestingly, it is machine dependent and the machine it fails on is a much older one. I wonder if it may actually be an illegal instruction. Are all of your jobs scheduled on the same node? Can you provide the output of

cat /proc/cpuinfo?

There is a minimal requirement to have bit manipulation instructions, I believe, the way its compiled. And some other SSE instructions.

rob-p commented 11 months ago

doesn't work

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping        : 5
microcode       : 0x1d
cpu MHz         : 2666.000
cache size      : 8192 KB
physical id     : 1
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 23
initial apicid  : 23
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm consta
nt_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 pop
cnt lahf_lm ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida spec_ctrl intel_stibp flush_l1d
bogomips        : 5332.56
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

works

processor       : 31
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7313 16-Core Processor
stepping        : 1
microcode       : 0xa0011ce
cpu MHz         : 2994.424
cache size      : 512 KB
physical id     : 1
siblings        : 16
core id         : 15
cpu cores       : 16
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm c
onstant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdr
and lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cp
b cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsave
opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
 vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 5976.06
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]processor       : 31
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 1
model name      : AMD EPYC 7313 16-Core Processor
stepping        : 1
microcode       : 0xa0011ce
cpu MHz         : 2994.424
cache size      : 512 KB
physical id     : 1
siblings        : 16
core id         : 15
cpu cores       : 16
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm c
onstant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdr
and lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cp
b cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsave
opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
 vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 5976.06
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

works

processor       : 87
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb000040
cpu MHz         : 1200.400
cache size      : 56320 KB
physical id     : 1
siblings        : 44
core id         : 28
cpu cores       : 22
apicid          : 121
initial apicid  : 121
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data
bogomips        : 4401.78
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

In particular, here is what I get if I take the set of instructions in the intersection of the machines where everything works, minus the instruction set where it doesn't work:

{'pcid', '3dnowprefetch', 'cpuid', 'cqm_mbm_total', 'abm', 'invpcid', 'arat', 'bmi1', 'cqm_occup_llc', 'popcnt', 'aes', 'x2apic', 'invpcid_single', 'cqm_llc', 'fma',
 'xsave', 'smep', 'cdp_l3', 'pdpe1gb', 'avx2', 'cat_l3', 'fsgsbase', 'smap', 'cqm_mbm_local', 'movbe', 'avx', 'f16c', 'bmi2', 'rdseed', 'pclmulqdq', 'cqm', 'rdt_a',
'adx'}

If I look across those, I am fairly certain that we don't enable AVX/AVX2 when we build on conda, but we do enable BMI2. Now, there exists a flag in the build to disable this, but it's a pretty widespread instruction set and important for bit operation speed. Intel introduced BMI2 in it's (now very old) Haswell line of machines, and AMD adopted them soon after. If the machine where you are seeing this behavior is missing them, that could be a cause. I'd prefer not to disable BMI2 in the conda build, so we'd have to see what other options exist. Of course, we'd need to see if this is, in fact, the issue first.

--Rob

wmacnair commented 11 months ago

Hi @rob-p, this is way too close to the metal for me to contribute anything helpful, so I have asked our internal HPC guys if they can comment. I'll let you know what they feed back :)

Will

wmacnair commented 11 months ago

I've also just shared a gDrive folder with your UoM email, called _simpleaffastqs. This is an additional sample that didn't work for us. Hopefully you've got this - let me know if not :)

rob-p commented 11 months ago

I got the link. I will try this on both machines above and see if the issue is the same. Btw, what’s the chemistry on this sample?

—Rob

wmacnair commented 11 months ago

Great :)

Chemistry is 10xV3. The call I used had all the same options as in my first comment here:

# simpleaf quantfication
simpleaf quant \
  --reads1 $R1_fs \
  --reads2 $R2_fs \
  --threads 16 \
  --index $ALEVIN_FRY_HOME/human_2020-A_splici/index \
  --chemistry 10xv3 --resolution cr-like \
  --expected-ori fw \
  --t2g-map $ALEVIN_FRY_HOME/human_2020-A_splici/index/t2g_3col.tsv \
  --unfiltered-pl $CELLRANGER_DIR/3M-february-2018.txt --min-reads 1 \
  --output ./Macnair_2022/output/ms01_alevin_fry/af_WM177

wmacnair commented 11 months ago

Response from an HPC colleague:

Hi Will, all our CPU nodes should be with cascadelake which is newer than hashwell. And we do have the BMI2 flags in the interactive node CPUs for example.

pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities But, Anaconda is different than miniconda - maybe that is the difference

I will try to reproduce the error

Best, Erica

rob-p commented 11 months ago

Thanks, @wmacnair! By the way, in the spirit of full reproducible transparency, the build I had that worked on the BMI2 node but failed on the non-BMI2 node was installed using micromamba, because I've lost patience with both conda and miniconda ;P. However, the underlying channels and package versions were the same.

--Rob

rob-p commented 11 months ago

Ok, some more brief updates. Apart from the machine that exhibits the illegal instruction, I've been unable to trigger a segfault. I tried the original dataset you pointed me at, as well as the one @marusakod suggested SAMN13262712, and also the one you shared via Google Drive @wmacnair.

I tried on a machine from 2016 (Intel Xeon), 2023 (AMD Epyc), as well as 2 MacOS machines (2018 Intel Mac) and (2020 Intel Mac). On all of these machines, I was not able to produce the segfault on these data using the simpleaf and piscem installed via micromamba.

It would be great if your HPC folks had any extra insight or luck reproducing the issue. Otherwise, the next thing I would ask is if you could provide some extra details about the specific OS distribution and version you are running on. My next step would be to try to more closely reproduce your environment by spinning up a Docker container with that precise OS and then installing these tools via conda to see if something about the specific library versions and setup of that distribution and version help trigger the issue.

--Rob

wmacnair commented 11 months ago

Thanks for this. Slightly frustrating that we have the error and you don't but I guess that's reassuring for simpleaf / alevin - hopefully this is a pretty rare case, and due to weird conda installation peculiarities.

I've passed on your most recent comment to Erica, and I'll let you know what I hear back.

Will

wmacnair commented 11 months ago

Erica also doesn't find any errors...

We are wondering whether it could be a problem with the index: I ended up generating the index locally, because piscem created 100ks of temp files when indexing on our cluster, so I had to kill it. So the index was generated on my laptop, then I copied it over to the cluster.

Does it sound like this could be an explanation / part of an explanation?

Will

rob-p commented 11 months ago

Hi @wmacnair,

Thanks for the updates. I also agree that it’s frustrating since it seems that you and @marusakod are able to reproduce the issue deterministically on some samples, so it doesn’t seem to be e.g. a stochastic thread related issue.

Though the index should be broadly compatible between machines (I built one index on the old intel machine and used it on all of those I listed above), it is possible that the index could be at fault. Are you sure the index completed successfully. Also, if there are different days layouts between the systems that could be problematic, perhaps.

What if Erica tries using your index? If you want to put it on Google drive, I could also try running with your index.

Best, Rob

wmacnair commented 11 months ago

We're currently trying out different combinations - I ran with her conda env and my index. I'm next going to try my conda env and her index.

And I thiiiink the index completed successfully, but I'm not completely sure. If this second test works ok, then it looks like it is the index that is at fault.

rob-p commented 11 months ago

Hi @wmacnair, please do keep me updated here! I'd be really interested to know if it was related to something with the index either being (a) incomplete or (b) built on a different machine. As I noted, the index should be transferrable between most machines (barring things like big-endian to little-endian transfer). If you suspect (a), then I can take a look at the actual index and see if there is any evidence of that from the files themselves or the log. In piscem build, we check the return codes of all sub-programs we run (cuttlefish2 and piscem-cpp build) and should bail if any of them fail. However, it would certainly be nice to have some sort of explicit "finished" marker that would prevent trying to use an incomplete index in downstream steps.

Best, Rob

wmacnair commented 11 months ago

Hi @rob-p, ok that's good to know.

Frustratingly for resolving this issue, right now I have other things that I need to work on, and Monday and Tuesday I'm on vacation.

But I have put this on my to-do list, and hopefully I'll be able to get to it end of next week :)

Cheers Will

rob-p commented 11 months ago

Hi @wmacnair,

I hope you had a nice vacation. I just wanted to freshen this up on your list so that I know if I should be investigating further or closing this issue. Just let me know when you have a chance to get around to it.

Best, Rob

wmacnair commented 11 months ago

Hi @rob-p

I'm currently having a look through where I had previously got to. There have been multiple different indices / conda envs, and I worry that I have got confused somewhere 😅 So feel free to make concrete suggestions on tests that I can do.

When I last looked, some tests we looked at were:

test sample failed with my index + Erica's simpleaf conda env
test sample ran ok with her index + my simpleaf conda env

This suggested that the problem was with my index. However today I rebuilt the index and ran using this, and got the same failure.

So could it be that indices created with my conda env are faulty??

These are the relevant lines from the conda env export yaml:

channels:
  - bioconda
  - conda-forge
  - main
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - alevin-fry=0.8.2=h4ac6f70_0
  - boost-cpp=1.78.0=h5adbc97_2
  - bzip2=1.0.8=h7f98852_4
  - icu=70.1=h27087fc_0
  - libgcc-ng=13.1.0=he5830b7_0
  - libgomp=13.1.0=he5830b7_0
  - libhwloc=2.9.1=hd6dc26d_0
  - libiconv=1.17=h166bdaf_0
  - libjemalloc=5.3.0=hcb278e6_0
  - libstdcxx-ng=13.1.0=hfd8a6a1_0
  - libxml2=2.10.3=hca2bb57_4
  - libzlib=1.2.13=hd590300_5
  - piscem=0.6.0=h09b9a2f_2
  - salmon=1.10.2=hecfa306_0
  - simpleaf=0.14.1=h4ac6f70_0
  - tbb=2021.9.0=hf52228f_0
  - xz=5.2.6=h166bdaf_0
  - zstd=1.5.2=hfc55251_7

Does anything there look strange?

I'll keep looking, so please keep open for now. Hopefully we can resolve by mid next week 💪

Will

rob-p commented 11 months ago

Hi @wmacnair,

I can check the detail of my conda environments as well and report back here. However, in the meantime, would it be possible for you to share your actual index? Inspecting that may actually help to point at what could be going wrong at the most fine-grained technical level.

Here is what I have in my micromamba environment for simpleaf:

List of packages in environment: "/fs/cbcb-lab/rob/rob/envs/simpleaf"

  Name           Version   Build        Channel
─────────────────────────────────────────────────────
  _libgcc_mutex  0.1       conda_forge  conda-forge
  _openmp_mutex  4.5       2_gnu        conda-forge
  alevin-fry     0.8.2     h4ac6f70_0   bioconda
  boost-cpp      1.78.0    h5adbc97_2   conda-forge
  bzip2          1.0.8     h7b6447c_0
  icu            70.1      h27087fc_0   conda-forge
  libgcc-ng      13.1.0    he5830b7_0   conda-forge
  libgomp        13.1.0    he5830b7_0   conda-forge
  libhwloc       2.9.1     hd6dc26d_0   conda-forge
  libiconv       1.17      h166bdaf_0   conda-forge
  libjemalloc    5.3.0     hcb278e6_0   conda-forge
  libstdcxx-ng   13.1.0    hfd8a6a1_0   conda-forge
  libxml2        2.10.3    hca2bb57_4   conda-forge
  libzlib        1.2.13    hd590300_5   conda-forge
  lz4-c          1.9.4     h6a678d5_0
  piscem         0.6.0     h09b9a2f_2   bioconda
  salmon         1.10.2    hecfa306_0   bioconda
  simpleaf       0.14.1    h4ac6f70_0   bioconda
  tbb            2021.9.0  hf52228f_0   conda-forge
  xz             5.4.2     h5eee18b_0
  zlib           1.2.13    hd590300_5   conda-forge
  zstd           1.5.5     hc292b87_0

Thanks! Rob

wmacnair commented 11 months ago

Thanks for this @rob-p. I've compared my and your packages, and there are a few discrepancies:

       user package version      build
1:     robp   bzip2   1.0.8 h7b6447c_0
2: wmacnair   bzip2   1.0.8 h7f98852_4
3:     robp   lz4-c   1.9.4 h6a678d5_0
4:     robp      xz   5.4.2 h5eee18b_0
5: wmacnair      xz   5.2.6 h166bdaf_0
6:     robp    zlib  1.2.13 hd590300_5
7:     robp    zstd   1.5.5 hc292b87_0
8: wmacnair    zstd   1.5.2 hfc55251_7

Do any of these stand out?

I should have shared a file called _human_2020-A_splicibuggy.zip with you via gDrive - let me know if that hasn't worked.

And let me know if you have any other suggestions!

Cheers Will

rob-p commented 11 months ago

I got the file link. I’ll take a look at that and report back here!

Thanks, Rob

rob-p commented 11 months ago

Hi @wmacnair,

So we have some real progress now! I can reproduce the segfault on a machine where I previously could not when using your index. I haven't debugged further yet, but this can be investigated now. One thing I noticed when looking at your index folder (not the cause of the segfault, I believe) is that there seems to be both files related to a piscem index and a salmon index! Was there some attempt to build both of the indices in the same output folder?

In terms of the indices themselves, I'll note that your buggy index is every so slightly smaller than my working index (and, I suspect, than your working index):

✦4 ❯ wc -c wills_index/human_2020-A_splici_buggy/index/piscem_idx.*
 957723916 wills_index/human_2020-A_splici_buggy/index/piscem_idx.ctab
 476412746 wills_index/human_2020-A_splici_buggy/index/piscem_idx.ectab
   6583967 wills_index/human_2020-A_splici_buggy/index/piscem_idx.refinfo
1888162004 wills_index/human_2020-A_splici_buggy/index/piscem_idx.sshash
3328882633 total

✦4 ❯ wc -c piscem_index/index/piscem_idx.*
 957740300 piscem_index/index/piscem_idx.ctab
 476405278 piscem_index/index/piscem_idx.ectab
   6617391 piscem_index/index/piscem_idx.refinfo
1888162004 piscem_index/index/piscem_idx.sshash
3328924973 total

This suggests to me that, for some reason, either the index construction was bugged, or the full index did not write to disk properly. It seems that the sshash component is exactly the same size, so the problem is to do with the contiguous table (ctab) and the reference information (refinfo).

Edit: It's not clear the size differences in refinfo are meaningful. It's due to a slight difference in the naming of intronic transcripts between the version of simpleaf you have (that uses pyroe) and the latest version with which I built the reference (that uses roers). I believe the contig tables ctab should be the same size regardless, but I've not yet investigated that.

Edit again: Ok, so now I'm actually curious ;). Since you are using the latest simpleaf — the reference should have been generated with roers rather than with pyroe. We moved from pyroe (a python package based on pyranges) to roers (a rust tool of our lab's own making and based on pola.rs) because pyranges introduced regressions frequently and necessitated pinning many other python packages. When using simpleaf 0.14.1 (and 0.14) the reference sequence should have been generated using roers rather than pyroe, yet yours seems to have been generated with the latter, despite you certainly having the latest version of simpleaf. This is even stranger as pyroe appears not to even be in your conda environment. I'm not at all sure that this is the cause of the problem since, as I said above, the reference sequences themselves seem equivalent, but it certainly does make me suspicious as to how this happened. Any ideas?

Thanks, Rob

wmacnair commented 11 months ago

Hi Rob

Well at least we've found a culprit! I just shared another index with you, that I freshly built last week. I think we still get errors with this - would you mind checking that too?

Thanks Will

rob-p commented 11 months ago

Hi Will,

Thanks! I got the new file. At first look, it's certainly cleaner (e.g. there's no evidence of a partial salmon index, just the piscem index). However, it still seems the reference was created with pyroe. You can see this in the simpleaf_index_log.json file, where there is a pyroe command rather than a roers command. Again, this isn't by itself a problem, but I don't understand why, with the latest simpleaf tool, pyroe is being used to create the reference rather than roers. Does this happen with a clean install of the latest version?

Update: @wmacnair: Regarding this specific index. I verified that I also get the segfault using it as well. I also verified that the contig table (the .ctab file) is, like the other index you sent, smaller than the one in my (working) index. I don't know precisely why this is happening, but it's my current best guess as to the root of the problem. I believe that the contig tables should be the same size. I suspect there are a few missing entries in the contig table in your non-working indices, which leads to a problem if and only if reads map to where those contigs should appear. I still feel this may be related to the fact that, despite having the latest version of simpleaf, your construction still seems to be using pyroe for reference building — it suggests to me there may be some version mismatches or multiple versions of some tools present in the environment, so I'd still like to get to the bottom of that somehow.

Best, Rob

wmacnair commented 11 months ago

Hi @rob-p

Pretty sure this is a PB in terms of github threads for me now 😄

I have tried a fresh simpleaf conda environment, rebuilt the indices with that (also confirmed that the simpleaf_index_log.json file has roers calls but nothing to pyroes).

Now the previous files that used to give segfaults do not, but in running it on other datasets we still find some that do.

I have shared a folder with you (called share_with_robp) which contains:

a yml for the clean conda simpleaf install
a tarball for the cleanly built index (human_2020-A_splici_clean.tar.gz)
fastq files for a sample that fails (simpleaf_fastqs_2)
a .sh file for how i ran simpleaf on these files, plus standard output and error logs

(please drop me a line if any of this is unclear)

I'm very curious to know whether you also get a segfault here...

Best Will

rob-p commented 11 months ago

Thanks @wmacnair!

I appreciate all of the help! I got the files you shared. I'll let you know if I get the segfaut and if I can also reproduce it with my own index ir not.

Thanks, Rob

rob-p commented 11 months ago

Ok, this new index looks almost identical to the one I built, so I suspect I will see the problems with my index as well. I am moving the files to the server to check and will let you know soon.

Edit: Indeed! Success in failure. Your latest reads fail with my index as well. Now we have a fully reproducible example!

rob-p commented 11 months ago

Hi @wmacnair,

Ok. Good news. I was able to track this down and find the offending read and logic. It resulted from (as expected) and incredibly rare occurrence when it was possible to mistakenly try to extend a failed k-mer search to find the next k-mer. In this particular case that search succeeded, but but because the predecessor failed, some critical information that the index needed was missing. Ultimately, this resulted in subtracting 1 from our sentinel value (std::numeric_limits<uint64_t>::max()) and then trying to use that to index into the index ... which obviously was out of bounds. I've fixed this upstream in piscem-cpp on the dev branch, and added a bit more defensive code around this logic to try to guard from such tricky edge cases in the future. I'm doing a bit of traveling tomorrow, so I'm not sure if I'll have a chance to cut and push the new piscem release, but I'll let you know when I do. The other good news is that the bug was only in piscem (so you'll only have to pull down the bumped version of that) and it was totally in the query logic and not in the index building (so you won't have to rebuild the index). I'll keep you posted and happy to answer any other questions you might have!

Best, Rob

wmacnair commented 11 months ago

That sounds great! I will also be unavailable next week, but I think @marusakod will be interested in any bug-fixed version.

A comment here - in our experience, I would say that this was rare, but not incredibly rare. Across data from multiple studies, failure rates for us ranged between 2 out of 150, and 5+ out of 34 samples. (As I'm unfamiliar with k-mers, I didn't really get the reason for the failure, so can't really speculate on why this might be.)

Thanks so much for all your efforts to look into this :) Will

rob-p commented 11 months ago

Hi @wmacnair,

Thanks for the heads up. I just meant incredibly rare in the space of k-mers. If we assume that each dataset is comprising billions (or 10s of billions) of k-mers, the rate of this corner case occurs very infrequently. Of course, the point is that it was a small but important missing case in the logic of how streaming query was handled, so it was great that you found it and a dataset to reliably reproduce it. I'll ping back here and also tag @marusakod as well when the new piscem is up on bioconda.

Best, Rob

rob-p commented 11 months ago

Hi @marusakod,

The bioconda process is en route, but it takes a while. In the meantime, if you would like to check with a pre-compiled piscem executable that the fix works on your end, I've created pre-compiled executables for both linux and aarch64 OSX on the piscem releases page. The latest release (v0.6.1) should have the fix.

Best, Rob

COMBINE-lab / simpleaf