Closed wmacnair closed 1 year ago
Hi @wmacnair,
Thanks so much for the detailed report, and for your kind words regarding simpleaf
. This is exactly the type of effect we hoped it would have :).
So, the error looks to arise from piscem-cpp
(the mapper, and one of the few components we have not yet migrated fully over to rust yet, though @TheJasonFan is working on rectifying that). What's interesting is that it appears to be a segfault, but shows up after the mapping is finished (we get to see how many reads were processed, and how many had mappings, etc.).
We'd be happy to take a look, figure out what is going on, and try and fix this. A couple of quick questions. Is it always the same samples that fail? If you run the sample again, does it fail? Would it be possible to share one of the failing samples privately just for the purposes of reproducing and fixing this issue? If not, would it be possible to see if there is a small subset of the sample on which you can reproduce the issue and share that? Typically, with these types of things it's usually some particular aspect of the data that triggers a corner case in the code that "typical" data doesn't, leading to the segfault (likely memory error).
Thanks! Rob
Hi @rob-p
Developing methods can sometimes be a thankless task, so I try to give some encourage when I can :)
Re your questions:
These samples are not yet published, so sharing is a bit more tricky, but @marusakod has had the same error (again, repeatedly) on the BioSample SAMN13262712 in this repo: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA589018&o=acc_s%3Aa There should be 3 sets (R1 + R2) of fastq files. (Thanks @marusakod!)
Hopefully that is enough to start with. I'll also have a look at subsampling my files and see if I can reproduce the error.
Cheers! Will
Hopefully that should help. I'll see if I can generate
Ah MaruΕ‘a suggests that sample 10X145-6 here might be easier to work with, as that already has R1 and R2 files labelled: https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/
Thanks! I'll give this a look and report back when I have some idea whats up.
In the meantime, could you also please share the command and files you used to generate the index/reference?
Thanks, Rob
@wmacnair & @marusakod,
I grabbed one of the samples linked above 10X145-6_S19_L003
, and everything completed successfully on that sample. I noted, however, that the link:
has several samples. I will start looking at the others, but if you could point me at a particular problem sample, that would be extra useful!
Thanks, Rob
I used the 10x reference genome, here.
These are the commands I used:
#!/bin/bash
#BSUB -J simpleaf_build_index_human
#BSUB -W 24:00
#BSUB -n 24
#BSUB -R "rusage[mem=8192]"
#BSUB -R "span[hosts=1]"
#BSUB -q long
#BSUB -eo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.err
#BSUB -oo /home/macnairw/packages/scProcess/.log/simpleaf_build_index_human.out
# how to call:
# bsub < /home/macnairw/packages/scProcess/scripts/simpleaf_build_index_human.sh
# set up environment
ml purge
ml Anaconda3/2021.05
conda activate af
# conda install numpy=1.23
ulimit -n 2048
# simpleaf configuration
export ALEVIN_FRY_HOME="/home/macnairw/packages/scProcess/data/alevin_fry_home"
simpleaf set-paths
# change working directory to somewhere not crazy
cd $ALEVIN_FRY_HOME
# set up this build
PROJ_DIR="/home/macnairw/packages/scprocess/data"
REF_DIR="$PROJ_DIR/reference_genome/refdata-gex-GRCh38-2020-A"
IDX_DIR="$PROJ_DIR/alevin_fry_home/human_2020-A_splici"
# simpleaf index
simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--rlen 91 \
--threads 24 \
--use-piscem
conda deactivate
Ok interesting... What if you try all of the samples there, in one run? I think that's what MaruΕ‘a did when it failed for her.
@wmacnair & @marusakod,
I grabbed one of the samples linked above
10X145-6_S19_L003
, and everything completed successfully on that sample. I noted, however, that the link:has several samples. I will start looking at the others, but if you could point me at a particular problem sample, that would be extra useful!
Thanks, Rob
I tried the smaller samples in HL377BGX9
and they processed as well. I am downloading the largest runs HHT7VDSXX
now and will try them. If they all work individually, then I will try everything together in a single (giant) run. I should also note that I'm using my own local version of piscem
(which should match the version on conda), but if I can't repro then I will switch to the conda executable directly.
--Rob
Hi @rob-p
Will is right, for sample 10X145-6
I used all fastq files in https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X145-6/, which is 24 files in total. You could try with sample 10X269-2
( https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/10X269-2_S01_L003-001.fastq.tar) which only has 3 fastq files and gave the same error.
Thanks @marusakod,
I was unable to reproduce this locally with my native build of piscem
(on the original 10X145-6 data) β I'm trying now with the conda install. I will also take a look at the 10X269-2
data you suggest above.
--Rob
So I ran on everything in 10x145-6
, and I did so with the version of piscem
obtained from conda
. Specifically, here is what I have for piscem
:
β¦ β― which piscem
/home/rob/mambaforge/envs/piscem/bin/piscem
β¦ β― piscem --version
piscem 0.6.0
One thing worth noting is that I don't seem to have 24 files when I grabbed everything in the directory for 10X145-6
. Rather, I have 16 files total (8 runs, with a pair of reads for each run). I parsed out the command generated by simpleaf
to obtain this list of files:
>>> reads1
['10X145-6_S13_L003_R1_001.fastq.gz', '10X145-6_S13_L004_R1_001.fastq.gz', '10X145-6_S19_L003_R1_001.fastq.gz', '10X145-6_S19_L004_R1_001.fastq.gz', '10X145-6_S10_L0
01_R1_001.fastq.gz', '10X145-6_S10_L002_R1_001.fastq.gz', '10X145-6_S10_L003_R1_001.fastq.gz', '10X145-6_S10_L004_R1_001.fastq.gz']
>>> reads2
['10X145-6_S13_L003_R2_001.fastq.gz', '10X145-6_S13_L004_R2_001.fastq.gz', '10X145-6_S19_L003_R2_001.fastq.gz', '10X145-6_S19_L004_R2_001.fastq.gz', '10X145-6_S10_L0
01_R2_001.fastq.gz', '10X145-6_S10_L002_R2_001.fastq.gz', '10X145-6_S10_L003_R2_001.fastq.gz', '10X145-6_S10_L004_R2_001.fastq.gz']
Can you see anything obvious missing from here?
One other thing worth asking β though I doubt it's an issue β is that these runs where things seem to be failing look to be pretty big (i.e. when using all of the data from 10X145-6
or the files in 10X269-2
). As such, there will be a non-trivial amount of output prior to the final count matrix. For example, when running on all of 10X145-6
, I get the following:
Is there any possibility you're resource constrained in terms of the disk allocated to the output, and that the mapper may be failing because it can't properly write the complete output files?
--Rob
Hi @rob-p
sorry, my bad, I was also counting index files. 16 is the correct number of R1 + R2 files for sample 10X145-6
. Disk resources should't be a problem. In fact, I had no problems with most of the equally big samples in https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v3/human/raw/.
Also, the issue appeared with smaller runs (BioSample SAMN13262712
in https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA589018&o=acc_s%3Aa)
MaruΕ‘a
Hi @rob-p , working through your comments and suggestions.
piscem
version check:
which piscem
# /projects/site/pred/neurogenomics/resources/scprocess_data/conda/af/bin/piscem
piscem --version
# piscem 0.6.0
Thanks @wmacnair and @marusakod,
The first order of business on my side is really just to be able to reproduce the error.
@marusakod β the samples here are much smaller, but there are many of them; ~is there a particular file or subset of files that you find trigger the issue?~. [edit: Sorry, I was just being dense, I see that you listed the actual sample ids SAMN13262712
already. I'll check these out!]
So far, I have not been able to trigger this with either my native built piscem
nor the piscem
from bioconda
. My next plan is to try on a different machine to see if I can trigger the issue, though the bioconda
build should really be accounting for most meaningful machine-to-machine variability. Once I'm able to find a sample that I can use to trigger the issue, I think diagnosis should not be too hard. The plan is to directly run the offending piscem
command (i.e. not through simpleaf
), possibly under gdb
to see where things are going awry.
--Rob
I've checked, and I can share a file with you where it fails for me. What do you recommend for sharing massive files? π
Will
I often use Box or Google Drive or some such. Whatever works ;P. Once you know where you want to put it, you can e-mail me at rob@cs.umd.edu with the relevant info!
Thanks
I'm working on this, but I'm having to configure some things for the first time so could be a little while...
@marusakod β I was again not able to reproduce this on SAMN13262712
. Of course, I did have to run it twice because I didn't realize it was 10xV2 chemistry and so the first time I get 0 mappings :).
I'll try again with these samples on another machine. They are easier to work with because they're pretty small. Here was my whole run; let me know if the command looks the same as yours (modulo paths):
> simpleaf quant --reads1 $R1 --reads2 $R2 --threads 24 --index ../piscem_index/index --chemistry 10xv2 --unfiltered-pl --resolution cr-l
ike --expected-ori fw --min-reads 1 --output SAMN13262712_quant
2023-07-25T18:37:05.977749Z INFO simpleaf::simpleaf_commands::quant: found local t2g file at ../piscem_index/index/t2g_3col.tsv, will attempt to use this since none
was provided explicitly
2023-07-25T18:37:05.986416Z INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /home/rob/mambaforge/envs/piscem/bin/piscem map-sc --index ../piscem_index/
index/piscem_idx --threads 24 -o SAMN13262712_quant/af_map -1 SAMN13262712/SRR10433215_1.fastq.gz,SAMN13262712/SRR10433216_1.fastq.gz,SAMN13262712/SRR10433217_1.fast
q.gz -2 SAMN13262712/SRR10433215_2.fastq.gz,SAMN13262712/SRR10433216_2.fastq.gz,SAMN13262712/SRR10433217_2.fastq.gz --geometry chromium_v2
2023-07-25T18:41:34.766218Z INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:34.766302Z INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /home/rob/alevin-fry/target/release/alevin-fry generate-p
ermit-list -i SAMN13262712_quant/af_map -d fw --unfiltered-pl /home/rob/.alevin_fry_home/plist/10x_v2_permit.txt --min-reads 1 -o SAMN13262712_quant/af_quant
2023-07-25T18:41:44.254231Z INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:44.254307Z INFO simpleaf::simpleaf_commands::quant: alevin-fry collate cmd : /home/rob/alevin-fry/target/release/alevin-fry collate -i SAMN13262712
_quant/af_quant -r SAMN13262712_quant/af_map -t 24
2023-07-25T18:41:50.039382Z INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2023-07-25T18:41:50.039433Z INFO simpleaf::simpleaf_commands::quant: cmd : "/home/rob/alevin-fry/target/release/alevin-fry" "quant" "-i" "SAMN13262712_quant/af_quan
t" "-o" "SAMN13262712_quant/af_quant" "-t" "24" "-m" "../piscem_index/index/t2g_3col.tsv" "-r" "cr-like"
2023-07-25T18:42:04.433691Z INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
Ok, on a different machine I have found success (in failing :P). I can reproduce the segfault on this sample.
The backtrace is, right now, unfortunately, not super useful. Interestingly, it seems to suggest an illegal instruction! I'm not sure if this is the same cause you are encountering or not.
Program received signal SIGILL, Illegal instruction.
0x00005555556d83ea in char const* fmt::v9::detail::parse_replacement_field<char, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&>(char const*, char const*, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
(gdb) bt
#0 0x00005555556d83ea in char const* fmt::v9::detail::parse_replacement_field<char, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&>(char const*, char const*, void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref)::format_handler&) ()
#1 0x00005555556d8e6e in void fmt::v9::detail::vformat_to<char>(fmt::v9::detail::buffer<char>&, fmt::v9::basic_string_view<char>, fmt::v9::basic_format_args<fmt::v9::basic_format_context<std::conditional<std::is_same<fmt::v9::type_identity<char>::type, char>::value, fmt::v9::appender, std::back_insert_iterator<fmt::v9::detail::buffer<fmt::v9::type_identity<char>::type> > >::type, fmt::v9::type_identity<char>::type> >, fmt::v9::detail::locale_ref) ()
#2 0x00005555556e63d9 in mindex::reference_index::reference_index(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) ()
#3 0x000055555575fb44 in run_pesc_sc ()
#4 0x00005555556264ff in piscem::main::hf73a101ed4e5a128 ()
#5 0x0000555555643523 in std::sys_common::backtrace::__rust_begin_short_backtrace::h0759eaefd4640ac1 ()
#6 0x000055555563fc7d in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h5773aadca783a79d ()
#7 0x0000555555666bdb in std::rt::lang_start_internal::h078acd489417d3c1 ()
#8 0x00005555556331a5 in main ()
(gdb)
Great! Nice to be celebrating a failure :D
I think we get "SIGSEGV". But at least we're getting something similar on the same sample...
So I get a segfault when I run it normally, the illegal instruction is only via GDB.
Interestingly, it is machine dependent and the machine it fails on is a much older one. I wonder if it may actually be an illegal instruction. Are all of your jobs scheduled on the same node? Can you provide the output of
cat /proc/cpuinfo
?
There is a minimal requirement to have bit manipulation instructions, I believe, the way its compiled. And some other SSE instructions.
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
microcode : 0x1d
cpu MHz : 2666.000
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
initial apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm consta
nt_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 pop
cnt lahf_lm ssbd rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida spec_ctrl intel_stibp flush_l1d
bogomips : 5332.56
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 31
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7313 16-Core Processor
stepping : 1
microcode : 0xa0011ce
cpu MHz : 2994.424
cache size : 512 KB
physical id : 1
siblings : 16
core id : 15
cpu cores : 16
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm c
onstant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdr
and lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cp
b cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsave
opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5976.06
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]processor : 31
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7313 16-Core Processor
stepping : 1
microcode : 0xa0011ce
cpu MHz : 2994.424
cache size : 512 KB
physical id : 1
siblings : 16
core id : 15
cpu cores : 16
apicid : 31
initial apicid : 31
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm c
onstant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdr
and lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cp
b cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsave
opt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale
vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5976.06
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
processor : 87
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
stepping : 1
microcode : 0xb000040
cpu MHz : 1200.400
cache size : 56320 KB
physical id : 1
siblings : 44
core id : 28
cpu cores : 22
apicid : 121
initial apicid : 121
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data
bogomips : 4401.78
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
In particular, here is what I get if I take the set of instructions in the intersection of the machines where everything works, minus the instruction set where it doesn't work:
{'pcid', '3dnowprefetch', 'cpuid', 'cqm_mbm_total', 'abm', 'invpcid', 'arat', 'bmi1', 'cqm_occup_llc', 'popcnt', 'aes', 'x2apic', 'invpcid_single', 'cqm_llc', 'fma',
'xsave', 'smep', 'cdp_l3', 'pdpe1gb', 'avx2', 'cat_l3', 'fsgsbase', 'smap', 'cqm_mbm_local', 'movbe', 'avx', 'f16c', 'bmi2', 'rdseed', 'pclmulqdq', 'cqm', 'rdt_a',
'adx'}
If I look across those, I am fairly certain that we don't enable AVX/AVX2 when we build on conda, but we do enable BMI2
. Now, there exists a flag in the build to disable this, but it's a pretty widespread instruction set and important for bit operation speed. Intel introduced BMI2 in it's (now very old) Haswell line of machines, and AMD adopted them soon after. If the machine where you are seeing this behavior is missing them, that could be a cause. I'd prefer not to disable BMI2
in the conda build, so we'd have to see what other options exist. Of course, we'd need to see if this is, in fact, the issue first.
--Rob
Hi @rob-p, this is way too close to the metal for me to contribute anything helpful, so I have asked our internal HPC guys if they can comment. I'll let you know what they feed back :)
Will
I've also just shared a gDrive folder with your UoM email, called _simpleaffastqs. This is an additional sample that didn't work for us. Hopefully you've got this - let me know if not :)
I got the link. I will try this on both machines above and see if the issue is the same. Btw, whatβs the chemistry on this sample?
βRob
Great :)
Chemistry is 10xV3. The call I used had all the same options as in my first comment here:
# simpleaf quantfication
simpleaf quant \
--reads1 $R1_fs \
--reads2 $R2_fs \
--threads 16 \
--index $ALEVIN_FRY_HOME/human_2020-A_splici/index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw \
--t2g-map $ALEVIN_FRY_HOME/human_2020-A_splici/index/t2g_3col.tsv \
--unfiltered-pl $CELLRANGER_DIR/3M-february-2018.txt --min-reads 1 \
--output ./Macnair_2022/output/ms01_alevin_fry/af_WM177
Response from an HPC colleague:
Hi Will, all our CPU nodes should be with cascadelake which is newer than hashwell. And we do have the BMI2 flags in the interactive node CPUs for example.
pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
But, Anaconda is different than miniconda - maybe that is the differenceI will try to reproduce the error
Best, Erica
Thanks, @wmacnair! By the way, in the spirit of full reproducible transparency, the build I had that worked on the BMI2 node but failed on the non-BMI2 node was installed using micromamba
, because I've lost patience with both conda and miniconda ;P. However, the underlying channels and package versions were the same.
--Rob
Ok, some more brief updates. Apart from the machine that exhibits the illegal instruction, I've been unable to trigger a segfault. I tried the original dataset you pointed me at, as well as the one @marusakod suggested SAMN13262712
, and also the one you shared via Google Drive @wmacnair.
I tried on a machine from 2016 (Intel Xeon), 2023 (AMD Epyc), as well as 2 MacOS machines (2018 Intel Mac) and (2020 Intel Mac). On all of these machines, I was not able to produce the segfault on these data using the simpleaf
and piscem
installed via micromamba
.
It would be great if your HPC folks had any extra insight or luck reproducing the issue. Otherwise, the next thing I would ask is if you could provide some extra details about the specific OS distribution and version you are running on. My next step would be to try to more closely reproduce your environment by spinning up a Docker container with that precise OS and then installing these tools via conda
to see if something about the specific library versions and setup of that distribution and version help trigger the issue.
--Rob
Thanks for this. Slightly frustrating that we have the error and you don't but I guess that's reassuring for simpleaf
/ alevin
- hopefully this is a pretty rare case, and due to weird conda installation peculiarities.
I've passed on your most recent comment to Erica, and I'll let you know what I hear back.
Will
Erica also doesn't find any errors...
We are wondering whether it could be a problem with the index:
I ended up generating the index locally, because piscem
created 100ks of temp files when indexing on our cluster, so I had to kill it. So the index was generated on my laptop, then I copied it over to the cluster.
Does it sound like this could be an explanation / part of an explanation?
Will
Hi @wmacnair,
Thanks for the updates. I also agree that itβs frustrating since it seems that you and @marusakod are able to reproduce the issue deterministically on some samples, so it doesnβt seem to be e.g. a stochastic thread related issue.
Though the index should be broadly compatible between machines (I built one index on the old intel machine and used it on all of those I listed above), it is possible that the index could be at fault. Are you sure the index completed successfully. Also, if there are different days layouts between the systems that could be problematic, perhaps.
What if Erica tries using your index? If you want to put it on Google drive, I could also try running with your index.
Best, Rob
We're currently trying out different combinations - I ran with her conda env and my index. I'm next going to try my conda env and her index.
And I thiiiink the index completed successfully, but I'm not completely sure. If this second test works ok, then it looks like it is the index that is at fault.
Hi @wmacnair, please do keep me updated here! I'd be really interested to know if it was related to something with the index either being (a) incomplete or (b) built on a different machine. As I noted, the index should be transferrable between most machines (barring things like big-endian to little-endian transfer). If you suspect (a), then I can take a look at the actual index and see if there is any evidence of that from the files themselves or the log. In piscem
build, we check the return codes of all sub-programs we run (cuttlefish2
and piscem-cpp build
) and should bail if any of them fail. However, it would certainly be nice to have some sort of explicit "finished" marker that would prevent trying to use an incomplete index in downstream steps.
Best, Rob
Hi @rob-p, ok that's good to know.
Frustratingly for resolving this issue, right now I have other things that I need to work on, and Monday and Tuesday I'm on vacation.
But I have put this on my to-do list, and hopefully I'll be able to get to it end of next week :)
Cheers Will
Hi @wmacnair,
I hope you had a nice vacation. I just wanted to freshen this up on your list so that I know if I should be investigating further or closing this issue. Just let me know when you have a chance to get around to it.
Best, Rob
Hi @rob-p
I'm currently having a look through where I had previously got to. There have been multiple different indices / conda envs, and I worry that I have got confused somewhere π So feel free to make concrete suggestions on tests that I can do.
When I last looked, some tests we looked at were:
simpleaf
conda envsimpleaf
conda envThis suggested that the problem was with my index. However today I rebuilt the index and ran using this, and got the same failure.
So could it be that indices created with my conda env are faulty??
These are the relevant lines from the conda env export
yaml:
channels:
- bioconda
- conda-forge
- main
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- alevin-fry=0.8.2=h4ac6f70_0
- boost-cpp=1.78.0=h5adbc97_2
- bzip2=1.0.8=h7f98852_4
- icu=70.1=h27087fc_0
- libgcc-ng=13.1.0=he5830b7_0
- libgomp=13.1.0=he5830b7_0
- libhwloc=2.9.1=hd6dc26d_0
- libiconv=1.17=h166bdaf_0
- libjemalloc=5.3.0=hcb278e6_0
- libstdcxx-ng=13.1.0=hfd8a6a1_0
- libxml2=2.10.3=hca2bb57_4
- libzlib=1.2.13=hd590300_5
- piscem=0.6.0=h09b9a2f_2
- salmon=1.10.2=hecfa306_0
- simpleaf=0.14.1=h4ac6f70_0
- tbb=2021.9.0=hf52228f_0
- xz=5.2.6=h166bdaf_0
- zstd=1.5.2=hfc55251_7
Does anything there look strange?
I'll keep looking, so please keep open for now. Hopefully we can resolve by mid next week πͺ
Will
Hi @wmacnair,
I can check the detail of my conda environments as well and report back here. However, in the meantime, would it be possible for you to share your actual index? Inspecting that may actually help to point at what could be going wrong at the most fine-grained technical level.
Here is what I have in my micromamba environment for simpleaf
:
List of packages in environment: "/fs/cbcb-lab/rob/rob/envs/simpleaf"
Name Version Build Channel
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
alevin-fry 0.8.2 h4ac6f70_0 bioconda
boost-cpp 1.78.0 h5adbc97_2 conda-forge
bzip2 1.0.8 h7b6447c_0
icu 70.1 h27087fc_0 conda-forge
libgcc-ng 13.1.0 he5830b7_0 conda-forge
libgomp 13.1.0 he5830b7_0 conda-forge
libhwloc 2.9.1 hd6dc26d_0 conda-forge
libiconv 1.17 h166bdaf_0 conda-forge
libjemalloc 5.3.0 hcb278e6_0 conda-forge
libstdcxx-ng 13.1.0 hfd8a6a1_0 conda-forge
libxml2 2.10.3 hca2bb57_4 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
lz4-c 1.9.4 h6a678d5_0
piscem 0.6.0 h09b9a2f_2 bioconda
salmon 1.10.2 hecfa306_0 bioconda
simpleaf 0.14.1 h4ac6f70_0 bioconda
tbb 2021.9.0 hf52228f_0 conda-forge
xz 5.4.2 h5eee18b_0
zlib 1.2.13 hd590300_5 conda-forge
zstd 1.5.5 hc292b87_0
Thanks! Rob
Thanks for this @rob-p. I've compared my and your packages, and there are a few discrepancies:
user package version build
1: robp bzip2 1.0.8 h7b6447c_0
2: wmacnair bzip2 1.0.8 h7f98852_4
3: robp lz4-c 1.9.4 h6a678d5_0
4: robp xz 5.4.2 h5eee18b_0
5: wmacnair xz 5.2.6 h166bdaf_0
6: robp zlib 1.2.13 hd590300_5
7: robp zstd 1.5.5 hc292b87_0
8: wmacnair zstd 1.5.2 hfc55251_7
Do any of these stand out?
I should have shared a file called _human_2020-A_splicibuggy.zip with you via gDrive - let me know if that hasn't worked.
And let me know if you have any other suggestions!
Cheers Will
I got the file link. Iβll take a look at that and report back here!
Thanks, Rob
Hi @wmacnair,
So we have some real progress now! I can reproduce the segfault on a machine where I previously could not when using your index. I haven't debugged further yet, but this can be investigated now. One thing I noticed when looking at your index folder (not the cause of the segfault, I believe) is that there seems to be both files related to a piscem index and a salmon index! Was there some attempt to build both of the indices in the same output folder?
In terms of the indices themselves, I'll note that your buggy
index is every so slightly smaller than my working index (and, I suspect, than your working index):
β¦4 β― wc -c wills_index/human_2020-A_splici_buggy/index/piscem_idx.*
957723916 wills_index/human_2020-A_splici_buggy/index/piscem_idx.ctab
476412746 wills_index/human_2020-A_splici_buggy/index/piscem_idx.ectab
6583967 wills_index/human_2020-A_splici_buggy/index/piscem_idx.refinfo
1888162004 wills_index/human_2020-A_splici_buggy/index/piscem_idx.sshash
3328882633 total
β¦4 β― wc -c piscem_index/index/piscem_idx.*
957740300 piscem_index/index/piscem_idx.ctab
476405278 piscem_index/index/piscem_idx.ectab
6617391 piscem_index/index/piscem_idx.refinfo
1888162004 piscem_index/index/piscem_idx.sshash
3328924973 total
This suggests to me that, for some reason, either the index construction was bugged, or the full index did not write to disk properly. It seems that the sshash
component is exactly the same size, so the problem is to do with the contiguous table (ctab
) and the reference information (refinfo
).
Edit: It's not clear the size differences in refinfo
are meaningful. It's due to a slight difference in the naming of intronic transcripts between the version of simpleaf
you have (that uses pyroe
) and the latest version with which I built the reference (that uses roers
). I believe the contig tables ctab
should be the same size regardless, but I've not yet investigated that.
Edit again: Ok, so now I'm actually curious ;). Since you are using the latest simpleaf
β the reference should have been generated with roers
rather than with pyroe
. We moved from pyroe
(a python package based on pyranges
) to roers
(a rust
tool of our lab's own making and based on pola.rs
) because pyranges
introduced regressions frequently and necessitated pinning many other python
packages. When using simpleaf 0.14.1
(and 0.14
) the reference sequence should have been generated using roers
rather than pyroe
, yet yours seems to have been generated with the latter, despite you certainly having the latest version of simpleaf
. This is even stranger as pyroe
appears not to even be in your conda
environment. I'm not at all sure that this is the cause of the problem since, as I said above, the reference sequences themselves seem equivalent, but it certainly does make me suspicious as to how this happened. Any ideas?
Thanks, Rob
Hi Rob
Well at least we've found a culprit! I just shared another index with you, that I freshly built last week. I think we still get errors with this - would you mind checking that too?
Thanks Will
Hi Will,
Thanks! I got the new file. At first look, it's certainly cleaner (e.g. there's no evidence of a partial salmon index, just the piscem index). However, it still seems the reference was created with pyroe
. You can see this in the simpleaf_index_log.json
file, where there is a pyroe
command rather than a roers
command. Again, this isn't by itself a problem, but I don't understand why, with the latest simpleaf tool, pyroe is being used to create the reference rather than roers. Does this happen with a clean install of the latest version?
Update: @wmacnair: Regarding this specific index. I verified that I also get the segfault using it as well. I also verified that the contig table (the .ctab
file) is, like the other index you sent, smaller than the one in my (working) index. I don't know precisely why this is happening, but it's my current best guess as to the root of the problem. I believe that the contig tables should be the same size. I suspect there are a few missing entries in the contig table in your non-working indices, which leads to a problem if and only if reads map to where those contigs should appear. I still feel this may be related to the fact that, despite having the latest version of simpleaf, your construction still seems to be using pyroe
for reference building β it suggests to me there may be some version mismatches or multiple versions of some tools present in the environment, so I'd still like to get to the bottom of that somehow.
Best, Rob
Hi @rob-p
Pretty sure this is a PB in terms of github threads for me now π
I have tried a fresh simpleaf
conda environment, rebuilt the indices with that (also confirmed that the simpleaf_index_log.json
file has roers
calls but nothing to pyroes
).
Now the previous files that used to give segfaults do not, but in running it on other datasets we still find some that do.
I have shared a folder with you (called share_with_robp
) which contains:
yml
for the clean conda simpleaf
installhuman_2020-A_splici_clean.tar.gz
)fastq
files for a sample that fails (simpleaf_fastqs_2
).sh
file for how i ran simpleaf
on these files, plus standard output and error logs(please drop me a line if any of this is unclear)
I'm very curious to know whether you also get a segfault here...
Best Will
Thanks @wmacnair!
I appreciate all of the help! I got the files you shared. I'll let you know if I get the segfaut and if I can also reproduce it with my own index ir not.
Thanks, Rob
Ok, this new index looks almost identical to the one I built, so I suspect I will see the problems with my index as well. I am moving the files to the server to check and will let you know soon.
Edit: Indeed! Success in failure. Your latest reads fail with my index as well. Now we have a fully reproducible example!
Hi @wmacnair,
Ok. Good news. I was able to track this down and find the offending read and logic. It resulted from (as expected) and incredibly rare occurrence when it was possible to mistakenly try to extend a failed k-mer search to find the next k-mer. In this particular case that search succeeded, but but because the predecessor failed, some critical information that the index needed was missing. Ultimately, this resulted in subtracting 1 from our sentinel value (std::numeric_limits<uint64_t>::max()
) and then trying to use that to index into the index ... which obviously was out of bounds. I've fixed this upstream in piscem-cpp
on the dev
branch, and added a bit more defensive code around this logic to try to guard from such tricky edge cases in the future. I'm doing a bit of traveling tomorrow, so I'm not sure if I'll have a chance to cut and push the new piscem
release, but I'll let you know when I do. The other good news is that the bug was only in piscem
(so you'll only have to pull down the bumped version of that) and it was totally in the query logic and not in the index building (so you won't have to rebuild the index). I'll keep you posted and happy to answer any other questions you might have!
Best, Rob
That sounds great! I will also be unavailable next week, but I think @marusakod will be interested in any bug-fixed version.
A comment here - in our experience, I would say that this was rare, but not incredibly rare. Across data from multiple studies, failure rates for us ranged between 2 out of 150, and 5+ out of 34 samples. (As I'm unfamiliar with k-mers, I didn't really get the reason for the failure, so can't really speculate on why this might be.)
Thanks so much for all your efforts to look into this :) Will
Hi @wmacnair,
Thanks for the heads up. I just meant incredibly rare in the space of k-mers. If we assume that each dataset is comprising billions (or 10s of billions) of k-mers, the rate of this corner case occurs very infrequently. Of course, the point is that it was a small but important missing case in the logic of how streaming query was handled, so it was great that you found it and a dataset to reliably reproduce it. I'll ping back here and also tag @marusakod as well when the new piscem is up on bioconda.
Best, Rob
Hi @marusakod,
The bioconda process is en route, but it takes a while. In the meantime, if you would like to check with a pre-compiled piscem
executable that the fix works on your end, I've created pre-compiled executables for both linux and aarch64 OSX on the piscem releases page. The latest release (v0.6.1) should have the fix.
Best, Rob
Hi
simpleaf
teamFirst off, a thank you: we've been using
simpleaf
to do lots of mapping for a single cell atlas project, and it has been making everything faster and smoother π (especially being able to S/U/A counts super fast). This is awesome πHowever, there are a small proportion of samples where it doesn't work, and we can't really see any reason why. I've pasted some example output below.
This doesn't appear to driven by e.g. tiny file sizes due to RNAseq not working:
Right now it's difficult to diagnose what is going on... Most samples in this experiment work, a handful don't, and I don't really see anything I can work with from the error message. Any suggestions??
Thanks! Will
(Tagging @marusakod as she has had similar problems.)