COMBINE-lab / simpleaf

A rust framework to make using alevin-fry even simpler
BSD 3-Clause "New" or "Revised" License
45 stars 3 forks source link

read lengths differ #162

Open robertzeibich opened 1 day ago

robertzeibich commented 1 day ago

R1 27; R2 125 R1 29; R2 99 R1 29; R2 92

Alignment crashes when the read length is less than 31.

[2024-10-19 14:41:05.213] [info] enable structural constraints : false [2024-10-19 14:41:05.213] [info] No poison k-mer map exists, or it was requested not to be used [2024-10-19 14:41:05.213] [info] loading index from /scratch/xm41/hg38_resources/scRNA/refdata-gex-GRCh38-2020-A/human-2020-A_splici/91/index/piscem_idx [2024-10-19 14:41:08.690] [info] done loading index Error: piscem mapping failed with exit status ExitStatus(unix_wait_status(134))

Should I create an index with simpleaf index -k 26?

rob-p commented 1 day ago

Can you please say how you are running simpleaf (the invocation) and also what versions of simpleaf, piscem and alevin-fry you are using? R1 is usually the technical read and only needs to be long enough to hold the barcode and UMI. Most commonly, it’s not even attempted to align it. Regardless, in the worst case reads shorter than the kmer length should be discarded ( not cause termination ). You shouldn’t have to change the index.

—Rob

robertzeibich commented 1 day ago

I highly appreciate your quick response. I am following the simpleaf online documentation.

Created a conda enviroment.

alevin-fry 0.10.0 h919a2d8_0 bioconda piscem 0.10.3 h7c313c4_0 bioconda salmon 1.10.3 h6dccd9a_2 bioconda simpleaf 0.17.2 h919a2d8_0 bioconda

rlen=91
IDX_DIR="${REF_DIR}/human-2020-A_splici/${rlen}"
 simpleaf index \
--output $IDX_DIR \
--fasta $REF_DIR/fasta/genome.fa \
--gtf $REF_DIR/genes/genes.gtf \
--rlen ${rlen} \
--threads $(nproc) \
--use-piscem
simpleaf quant \
--reads1 ${workdir}/${sra_id}_1.fastq.gz \
--reads2 ${workdir}/${sra_id}_2.fastq.gz \
--threads $(nproc) \
--index $IDX_DIR/index \
--chemistry 10xv3 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output ${workdir}/alevin/${sra_id}
rob-p commented 1 day ago

Hrmm, certainly it should not be the case that read 1 is being mapped in 10xv3 chemistry. You more that this happens when read 1 is < 31 nucleotides. Do you have data with longer R1 where it runs to completion?

If it is possible to share your index somehow, and a small sample of reads that reproduce the problem (a few hundred or few thousand should suffice), we’d be happy to take a look.

robertzeibich commented 1 day ago

Sure. I am happy to. Where should I send the files to?

rob-p commented 1 day ago

Hi @robertzeibich,

The read pairs (when gzipped) should be small enough to upload directly here. If you send me an e-mail (rob@cs.umd.edu), I can share a box folder with you where you can upload the index.

--Rob

robertzeibich commented 1 day ago

I selected the first 5,000 rows:

SRR12905322_1.fastq_sub.gz SRR12905322_2.fastq_sub.gz

I hope that this is enough. I will upload the index in a second.

robertzeibich commented 1 day ago

Based on your question: Do you have data with longer R1 where it runs to completion? Yes, I have longer R1 where it runs to completion.

Because of this post, I thought that the problem might be k-mer related.

However, some other runs ran successfully with a low R1 read length.