alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Segfault during 10x 5' read alignment #1679

Open balayev1 opened 2 years ago

balayev1 commented 2 years ago

Hi Alex,

First, thank you for developing and maintaining the STAR aligner, it's a great tool!!!

I have been trying to align reads from 10x 5' protocol using the code specified in the STARsolo manual but keep getting the segmentation fault. These are STARsolo parameters that I use: /gpfs/gibbs/pi/kaminski/public/softwares/STAR-2.7.9a/bin/Linux_x86_64_static/STAR \ --soloType CB_UMI_Simple \ --readFilesCommand zcat \ --soloBarcodeMate 1 \ --clip5pNbases 39 0 \ --soloFeatures Gene GeneFull SJ Velocyto \ --runThreadN 8 \ --soloCBlen 16 \ --soloUMIstart 17 \ --soloUMIlen 10 \ --twopassMode Basic \ --soloStrand Reverse \ --soloCBwhitelist /vast/palmer/scratch/kaminski/ab3478/10x_White_Lists/737K-august-2016.txt \ --genomeDir /gpfs/gibbs/pi/kaminski/public/Backup/Jonas/genome_index_human/GENCODE_release37_GRCh38.p13/STARindex_noGTF/ \ --sjdbGTFfile /gpfs/gibbs/pi/kaminski/public/Backup/Jonas/genome_index_human/GENCODE_release37_GRCh38.p13/GTF/gencode.v37.annotation.gtf \ --readFilesIn /vast/palmer/scratch/kaminski/ab3478/Covid19_Haberman_2020/processedData/sample_out/HD0005/trimmed_merged_fastq/trimmed.R2.fastq.gz /vast/palmer/scratch/kaminski/ab3478/Covid19_Haberman_2020/processedData/sample_out/HD0005/trimmed_merged_fastq/trimmed.R1.fastq.gz \ --outFileNamePrefix /vast/palmer/scratch/kaminski/ab3478/Covid19_Haberman_2020/processedData/sample_out/HD0005/

These are the log output files:

Log.progress.out.txt Log.out.txt

When I tried to remove "--soloBarcodeMate 1 --clip5pNbases 39 0 " and add "--soloBarcodeReadLength 0" option, the run finished successfully but the cDNA on read 1 is not aligned. Could you please tell me what could be the issue? I really appreciate your help.

P.S. I have tried STAR version 2.7.9a and 2.7.10a with the same error.

alexdobin commented 2 years ago

Hi @balayev1

since you use the barcode --soloBarcodeMate 1 and barcode is on mate 1 in the 5' protocol, you need to list Read1 first in the --readFilesIn, i.e.

--readFilesIn /vast/palmer/scratch/kaminski/ab3478/Covid19_Haberman_2020/processedData/sample_out/HD0005/trimmed_merged_fastq/trimmed.R1.fastq.gz /vast/palmer/scratch/kaminski/ab3478/Covid19_Haberman_2020/processedData/sample_out/HD0005/trimmed_merged_fastq/trimmed.R2.fastq.gz
balayev1 commented 2 years ago

Hi Alex,

Thank you for the quick reply. I switched the places of the FASTQ files but with no success. These are log output files:

Log.progress.out.txt Log.out.txt

alexdobin commented 2 years ago

Hi @balayev1,

my next suggestion would be to map untrimmed reads, as trimming may cause troubles.

balayev1 commented 2 years ago

Hi Alex,

Thank you for the reply. I tried to run with the raw FASTQ files but still, get the same error. These are the log output files:

Log.progress.out.txt Log.out.txt

alexdobin commented 2 years ago

Could you please try to map a small subset of reads, with --readMapNumber 1000000 ?

balayev1 commented 2 years ago

Hi Alex,

Run finished successfully after subsetting the reads. Does it mean that there are damaged reads in the FASTQ files? Thank you.

alexdobin commented 2 years ago

It's a possibility... Are these files coming straight from the sequencer, or you are doing some kind of pre-processing? I would try to run it on the "raw" files. And another check - run it without --twopassMode Basic.

balayev1 commented 2 years ago

Hi Alex,

These samples are raw FASTQ files from the GEO dataset GSE135893. I tried to trim repeated TSO sequences from FASTQ files and align them. The run was successful even when I tried it on trimmed FASTQ files with subsetting 1M and 10M reads but not with 50M. When I remove --readMapNumber 1000000and--twopassMode Basic`, the run on raw FASTQ files produces the same error. My concern is when I align only reads in the second FASTQ, the run is successful but not when I include cDNA on read 1. I guess it means I have some short reads in the first FASTQ file.

alexdobin commented 2 years ago

Hi @balayev1

Checking the length of all reads would be a good idea. Also, you can try to run mapping with Read1 and 2, but without any solo options. This dataset seems to contain multiple runs, which SRR did you use? I can try to run to see if I can reproduce the error.

balayev1 commented 2 years ago

Hi Alex,

These are the SRR files for the aligned sample: SRR9985495,SRR9985496,SRR9985497,SRR9985498,SRR9985499,SRR9985500,SRR9985501,SRR9985502,SRR9985503,SRR9985504,SRR9985505,SRR9985506,SRR9985507,SRR9985508,SRR9985509,SRR9985510,SRR9985511,SRR9985512,SRR9985513,SRR9985514,SRR9985515,SRR9985516,SRR9985517,SRR9985518.

Would I not be risking to align part of barcode and UMI sequences as well to the genome, if I map Read 1 separately? Thanks in advance.

Kind regards, Agshin

alexdobin commented 2 years ago

Hi Agshin,

I would recommend mapping these files separately to see if the error occurs for any of them. The last files that you merged are more likely to contain the problem since the mapping goes almost to the end of the merged file.

If you map the merged files with --readFilesIn Read1 Read2 --clip5pNbases 39 0 without solo options, it will clip off the barcode (i.e. 39b of the first read) and the rest should map properly to the genome. This will tell us whether the error occurs in the solo routines or simple mapping.

balayev1 commented 2 years ago

Hi Alex,

Sorry for the long inactivity. I tried to remove all STARSolo options and only kept --readFilesIn Read1 Read2 --clip5pNbases 39 0 and the run finished successfully. I guess the error comes from the solo options but I'm not sure where? Thank you very much for your help.

Kind regards, Agshin

alexdobin commented 2 years ago

Hi Agshin,

the next step would be to map these files separately to see if the error occurs for any of them. The last files that you merged are more likely to contain the problem since the mapping goes almost to the end of the merged file.