Error: alevin-fry generate-permit-list failed with exit status ExitStatus(unix_wait_status(25856))

lichtobergo commented 6 months ago

Hi, I encountered another issue after my first issue #133 was successfully solved. Im working in a conda environment generated as follows:

mamba create -n af -y -c bioconda -c conda-forge simpleaf piscem

I tried to quantify 10x Genomics 3'GEX v3.1 data against a splici index and followed the guide described here: https://www.sc-best-practices.org/introduction/raw_data_processing.html#a-real-world-example

Everything worked until the simpleaf quant command which exits with the following output

2024-03-15T08:31:48.046213Z  INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /home/haberl/miniforge3/envs/af/bin/piscem map-sc --index simpleaf_index/index/piscem_idx --threads 8 -o pseudoalignment/quants/006/af_map --max-ec-card 4096 --skipping-strategy permissive --max-hit-occ 256 --max-hit-occ-recover 1024 --max-read-occ 2500 -1 GEX_fastq/006_R1.fastq.gz -2 GEX_fastq/006_R2.fastq.gz --geometry chromium_v3
2024-03-15T09:13:40.029757Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2024-03-15T09:13:40.029828Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /home/haberl/miniforge3/envs/af/bin/alevin-fry generate-permit-list -i pseudoalignment/quants/006/af_map -d fw --unfiltered-pl alevin_fry_home/plist/10x_v3_permit.txt --min-reads 10 -o pseudoalignment/quants/006/af_quant
2024-03-15T09:13:59.934347Z ERROR simpleaf::utils::prog_utils: command unsuccessful (exit status: 101): "/home/haberl/miniforge3/envs/af/bin/alevin-fry" "generate-permit-list" "-i" "pseudoalignment/quants/006/af_map" "-d" "fw" "--unfiltered-pl" "alevin_fry_home/plist/10x_v3_permit.txt" "--min-reads" "10" "-o" "pseudoalignment/quants/006/af_quant"
2024-03-15T09:13:59.934398Z ERROR simpleaf::utils::prog_utils: stderr :
====
2024-03-15 10:13:44 INFO number of unfiltered bcs read = 6,794,880
2024-03-15 10:13:44 INFO paired : false, ref_count : 119,481, num_chunks : 30,376
2024-03-15 10:13:44 INFO read 2 file-level tags
2024-03-15 10:13:44 INFO read 2 read-level tags
2024-03-15 10:13:44 INFO read 1 alignemnt-level tags
2024-03-15 10:13:44 INFO File-level tag values TagMap { keys: [TagDesc { name: "cblen", typeid: Int(U16) }, TagDesc { name: "ulen", typeid: Int(U16) }], dat: [U16(16), U16(12)] }
thread 'main' panicked at /opt/conda/conda-bld/alevin-fry_1709936811913/work/.cargo/git/checkouts/libradicl-4bc45f05b43aa027/1e99e76/src/record.rs:230:48:
called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
====
Error: alevin-fry generate-permit-list failed with exit status ExitStatus(unix_wait_status(25856))

My hardware info: Linux Ubuntu 22.04 LTS 6.5.0-24-generic kernel output of lscpu

Architektur:                       x86_64
  CPU Operationsmodus:             32-bit, 64-bit
  Adressgrößen:                    46 bits physical, 48 bits virtual
  Byte-Reihenfolge:                Little Endian
CPU(s):                            12
  Liste der Online-CPU(s):         0-11
Anbieterkennung:                   GenuineIntel
  Modellname:                      Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz
    Prozessorfamilie:              6
    Modell:                        63
    Thread(s) pro Kern:            1
    Kern(e) pro Socket:            6
    Sockel:                        2
    Stepping:                      2
    Maximale Taktfrequenz der CPU: 1600,0000
    Minimale Taktfrequenz der CPU: 1200,0000
    BogoMIPS:                      3200.03
    Markierungen:                  fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon
                                    pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
                                    movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbas
                                   e tsc_adjust bmi1 avx2 smep bmi2 erms invp

The quant command worked one time and finished successful but since then only this error occurs.

rob-p commented 6 months ago

Thanks for the report, @lichtobergo. Would you be able to share the output of your mapping directory via some mechanism? That is pseudoalignment/quants/006/af_map. This would allow us to debug whatever is going wrong in the processing of this RAD file.

One thing you might also try is to see if you get the same error if you explicitly install alevin-fry 0.8.2 in your conda enviornment. It was just recently updated to 0.9.0 and, though that release was tested before release, it would be good to know if the issue is specifically tied to the new version.

Thanks! Rob

lichtobergo commented 6 months ago

I could share the af_map via ownCloud if that is ok for you? I will try your suggestion setting the version of alevin-fry and report back.

rob-p commented 6 months ago

Sounds good!

lichtobergo commented 6 months ago

That's the link. Let me know when download is finished so I can take it down.

rob-p commented 6 months ago

Hi @lichtobergo,

Thank you! I have been able to download the data an reproduce the error you list above. You can remove the files now. I will note I tried alevin-fry 0.8.2, and the problem persists, so I don't think the issue is an alevin-fry update issue. It seems, somehow, that the RAD file is incomplete / shorter than expected. Is the base data on which you're running publicly available? I'd like to try to reproduce the RAD file and see if that shows up as truncated reproducibly or if there is something going on that may be specific to your system.

Thanks, Rob

lichtobergo commented 6 months ago

Thanks again, Rob, for your effort helping me! Unfortunately the base data is not publicy available. I don't know how to share these large fastq files in the best way. I will try to use older data that I already analyzed last year in August on a different machine, which is unfortunately not around anymore. At least I know that the data is fine. Would that help?

rob-p commented 6 months ago

Anything that we could do to reproduce the base error on our end would help, as it would let us diagnose the issue. For example, if you run into the same issue with another dataset (that is publicly available) we can check on that. Alternatively, if you can observe the same issue on a small subset of this data (say if you just took the head of the first 5M reads or so of both files), do you see the same issue? In that case, those prefixes of the files might be small and easy to share.

--Rob

lichtobergo commented 6 months ago

Ok, I will run a public data set on my machine and depending on the results I then would share subsets of our data.

rob-p commented 6 months ago

Thank you!

lichtobergo commented 6 months ago

Hi Rob, I did test simpleaf with a the published test data set found here. With this data and reference everything went well. I could make the splici index and with that could quantify the reads. Seems like there is a problem with my data. That would be even worse. As you suggested I extracted the first 5,000,000 reads from my data and uploaded them to ownCloud. I used the following command to do this and hope that is ok.

seqkit head -n 5000000 005_R1.fastq.gz > sample_R1.fq.gz
seqkit head -n 5000000 005_R2.fastq.gz > sample_R2.fq.gz

Genome fasta Genes gtf I hope that is enough to reproduce the error. I just want to add that I appreciate your efforts tremendously!

Michael

rob-p commented 6 months ago

Hi @lichtobergo,

Thank you! I will test this out today. Just to verify, you observe the error on this subset if your data as well? If so, I will try to reproduce it on our machine and debug locally.

best, Rob

lichtobergo commented 6 months ago

I'm still trying to confirm that. But my machine is trying to make the splici index since 4 hours. Maybe it is stuck? (I know I don't have to build it every time from scratch but did it anyway). I will get back to you once I know if quantifying works with this subset. Sorry for making this debug so hard but I'm just a user of these tools and not used to this process.

lichtobergo commented 6 months ago

Just as an update! Quantifying the subset was successful. I post the output of the simpleaf quant command:

(simpleaf) haberl@UG-UKWB100-S13:~/mnt/data/p1456$ simpleaf quant \
> -c 10xv3 -t12 \
> -1 GEX_fastq/sample_R1.fq.gz -2 GEX_fastq/sample_R2.fq.gz \
> -i $HOME/references/simpleaf_index/index \
> -u -r cr-like \
> -m $HOME/references/simpleaf_index/index/t2g_3col.tsv \
> -o simpleaf_quant
2024-03-18T14:21:27.010138Z  INFO simpleaf::simpleaf_commands::quant: piscem map-sc cmd : /home/haberl/miniforge3/envs/simpleaf/bin/piscem map-sc --index /home/haberl/references/simpleaf_index/index/piscem_idx --threads 12 -o simpleaf_quant/af_map --max-ec-card 4096 --skipping-strategy permissive --max-hit-occ 256 --max-hit-occ-recover 1024 --max-read-occ 2500 -1 GEX_fastq/sample_R1.fq.gz -2 GEX_fastq/sample_R2.fq.gz --geometry chromium_v3
2024-03-18T14:22:02.616907Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2024-03-18T14:22:02.617000Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry generate-permit-list cmd : /home/haberl/miniforge3/envs/simpleaf/bin/alevin-fry generate-permit-list -i simpleaf_quant/af_map -d fw --unfiltered-pl ./plist/10x_v3_permit.txt --min-reads 10 -o simpleaf_quant/af_quant
2024-03-18T14:22:07.965578Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2024-03-18T14:22:07.965636Z  INFO simpleaf::simpleaf_commands::quant: alevin-fry collate cmd : /home/haberl/miniforge3/envs/simpleaf/bin/alevin-fry collate -i simpleaf_quant/af_quant -r simpleaf_quant/af_map -t 12
2024-03-18T14:22:09.291224Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)
2024-03-18T14:22:09.291273Z  INFO simpleaf::simpleaf_commands::quant: cmd : "/home/haberl/miniforge3/envs/simpleaf/bin/alevin-fry" "quant" "-i" "simpleaf_quant/af_quant" "-o" "simpleaf_quant/af_quant" "-t" "12" "-m" "/home/haberl/references/simpleaf_index/index/t2g_3col.tsv" "-r" "cr-like"
2024-03-18T14:22:11.798996Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)

I will now quickly check if the full data set also runs successfully.

lichtobergo commented 6 months ago

So, back again. With the full data I still get the same error as in the original post. That's seems really weird to me.

rob-p commented 6 months ago

It is super weird! Here’s an idea. What if you extract and run on just the last 5 or 10 million reads in the file? I suspect something strange about either the reads themselves, or the mapping output in the file.

—Rob

lichtobergo commented 6 months ago

Ok, new discoveries. I tried, as you suggested, with the last 5 million reads and quantification was successful. So I thought maybe it has to do with the size of the fastq files. As it is now, I'm working on a remote machine and the data is on a different server (I think samba share) which I have mounted on my remote vie CIFS in fstab. Out of curiosity I copied the fastq files to the remote and tried again and this time it worked with the full data set. Although, I have no clue why and how that causes trouble, I think we are closing in on the problem. Unfortunately, copying the data is no long term solution as capacity on the remote machine is limited and we have a lot of data to analyze. Could there be a workaround for that?

rob-p commented 6 months ago

Hi @lichtobergo,

Thanks for the excellent detective work. Indeed, I think you're likely closing in on something. The fact that everything works properly when the file is local, but that it fails when the file is remote, certainly seems a plausible explanation.

In theory, there is no reason that local and remote storage should behave or be treated differently — all of the relevant differences should be abstracted away at the level of the OS and filesystem so that by the time we get to piscem, the remote file behaves just as a local one. However, I imagine that there could be hiccups in serving up a large file that, perhaps, result in a truncated stream or a failed system call at some point. I don't immediately have a great solution to this problem, but you might ask your sysadmin about the settings on the filestore. Is there anything that might cause a problem or could potentially be set differently. The other thought I have is kind of even more of a hack, but might work. Specifically, piscem can read from named pipes. So one thing you might try is to pipe the file through a named pipe to let the OS try to buffer the contents and deal with any issues in the background. This would look something like the following (assuming your shell is bash-like and supports process substitution).

$ simpleaf quant <... other options ...> -1 <(gunzip -c path_to_compressed_read_1.fq.gz) -2 <(gunzip -c path_to_compressed_read_2.fq.gz) <... any other options ..>

The <(command) syntax is just shorthand for creating a FIFO (named pipe) and redirecting it's standard output. Perhaps running it reading the files this way will avoid whatever error is being encountered when reading the remote SAMBA link natively?

--Rob

lichtobergo commented 6 months ago

Hi Rob, Thanks for your tips! I may try that. In the mean time I tried using a Snakemake pipeline (which I wanted to use anyway) to copy a set of read files to local, run simpleaf and, delete the reads and rinse and repeat. By doing so I found that simpleaf also fail when the output is written to remote and read files are locally. I have to ask our sysadmin what up with that. I close this issue now as it has clearly nothing to do with simpleaf. But thanks a bunch for your great help and making all these great tools with your group!

Best, Michael

rob-p commented 6 months ago

Thanks, Michael! Please do reach out again if you run into any other issues and feel free to comment in this thread if your sysadmin is able to resolve this issue.

Best, Rob

COMBINE-lab / simpleaf

Error: alevin-fry generate-permit-list failed with exit status ExitStatus(unix_wait_status(25856)) #134