COMBINE-lab / simpleaf

A rust framework to make using alevin-fry even simpler
BSD 3-Clause "New" or "Revised" License
46 stars 5 forks source link

read lengths differ #162

Closed robertzeibich closed 2 weeks ago

robertzeibich commented 1 month ago

R1 27; R2 125 R1 29; R2 99 R1 29; R2 92

Alignment crashes when the read length is less than 31.

[2024-10-19 14:41:05.213] [info] enable structural constraints : false [2024-10-19 14:41:05.213] [info] No poison k-mer map exists, or it was requested not to be used [2024-10-19 14:41:05.213] [info] loading index from /scratch/xm41/hg38_resources/scRNA/refdata-gex-GRCh38-2020-A/human-2020-A_splici/91/index/piscem_idx [2024-10-19 14:41:08.690] [info] done loading index Error: piscem mapping failed with exit status ExitStatus(unix_wait_status(134))

Should I create an index with simpleaf index -k 26?

rob-p commented 1 month ago

Can you please say how you are running simpleaf (the invocation) and also what versions of simpleaf, piscem and alevin-fry you are using? R1 is usually the technical read and only needs to be long enough to hold the barcode and UMI. Most commonly, it’s not even attempted to align it. Regardless, in the worst case reads shorter than the kmer length should be discarded ( not cause termination ). You shouldn’t have to change the index.

—Rob

robertzeibich commented 1 month ago

I highly appreciate your quick response. I am following the simpleaf online documentation.

Created a conda enviroment.

alevin-fry 0.10.0 h919a2d8_0 bioconda piscem 0.10.3 h7c313c4_0 bioconda salmon 1.10.3 h6dccd9a_2 bioconda simpleaf 0.17.2 h919a2d8_0 bioconda

rlen=91
IDX_DIR="${REF_DIR}/human-2020-A_splici/${rlen}"
 simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--rlen ${rlen} \
--threads $(nproc) \
--use-piscem
simpleaf quant \
--reads1 ${workdir}/${sra_id}_1.fastq.gz \
--reads2 ${workdir}/${sra_id}_2.fastq.gz \
--threads $(nproc) \
--index $IDX_DIR/index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output ${workdir}/alevin/${sra_id}
rob-p commented 1 month ago

Hrmm, certainly it should not be the case that read 1 is being mapped in 10xv3 chemistry. You more that this happens when read 1 is < 31 nucleotides. Do you have data with longer R1 where it runs to completion?

If it is possible to share your index somehow, and a small sample of reads that reproduce the problem (a few hundred or few thousand should suffice), we’d be happy to take a look.

robertzeibich commented 1 month ago

Sure. I am happy to. Where should I send the files to?

rob-p commented 1 month ago

Hi @robertzeibich,

The read pairs (when gzipped) should be small enough to upload directly here. If you send me an e-mail (rob@cs.umd.edu), I can share a box folder with you where you can upload the index.

--Rob

robertzeibich commented 1 month ago

I selected the first 5,000 rows:

SRR12905322_1.fastq_sub.gz SRR12905322_2.fastq_sub.gz

I hope that this is enough. I will upload the index in a second.

robertzeibich commented 1 month ago

Based on your question: Do you have data with longer R1 where it runs to completion? Yes, I have longer R1 where it runs to completion.

Because of this post, I thought that the problem might be k-mer related.

However, some other runs ran successfully with a low R1 read length.

robertzeibich commented 2 weeks ago

This issue can be closed due to @rob-p support.

If you should run into the same error (see first message), make sure that the FASTQ files were downloaded correctly.

Re different read lenght:

"R1 is usually the technical read and only needs to be long enough to hold the barcode and UMI" R2 contains the sequence. From a given sequence the length can be extracted.

As I stated in my first message, I got read lengths of

R1 27; R2 125 R1 29; R2 99 R1 29; R2 92

At the end, I used the read lengths 98 and 91 to build the indices. However, I also tried 99 and 92 and the number of cells and genes were exactly the same.