GriffinBINF commented 10 months ago

Issue Description: I am experiencing a segmentation fault when using STARsolo (version 2.7.11a) to process large single-cell RNA sequencing datasets from axolotl leukocytes. The datasets are paired 10x Genomics FASTQ files from the SRA with accession IDs SRR10445716 to SRR10445723.

System and Resource Allocation:

STAR Version: 2.7.11a
System: High-performance compute cluster
Resources Allocated: 500 GB memory and 64 cores per task (also tried increasing to 800 GB)

Input Data:

Data Type: Axolotl leukocyte scRNAseq
Data Size: Each run is approximately 84-128 GB (unzipped)
Goal: Obtain gene counts and Velocyto splicing information

Error Details:

The segmentation fault occurs post-mapping, as indicated by the Log.out file which ends with successful mapping completion.
Error Log Snippet: line 8: 59023 Segmentation fault "${cmd}" "$@"
Memory Utilization at Fault: VmPeak: 298123108 kB; VmRSS: 295204524 kB

Troubleshooting Attempts:

Updated STAR from version 2.7.6a to 2.7.11a
Increased memory allocation to 800 GB
Modified parameters: tried various combinations including toggling twopassMode basic, toggling soloMultiMappers and adding/removing soloFeatures arguments

Request for Assistance: I am seeking advice on resolving this segmentation fault. Are there any known issues with STARsolo handling large datasets, or could there be specific parameter adjustments that might mitigate this error? Any insights or suggestions for workarounds would be greatly appreciated. As an alternative, I'm considering aligning with STAR and then using Velocyto or another tool for gene and splicing counts, but I am open to recommendations.

GriffinBINF commented 10 months ago

Update: I have been able to successfully run STAR with some STARsolo commands to align the genome and it does create the solo.out folder, along with the barcodes.tsv file and a mostly empty matrix file. This leads me to believe that the segfault is not being caused by write permissions or a non-existent output file for the counts.

Here is my current command:

Remove soloFeatures

STAR --genomeDir ${GENOME_DIR} \ --runThreadN 64 \ --readFilesIn ${RAW_DATA_DIR}/${ACCESSION}/${ACCESSION}_2.fastq ${RAW_DATA_DIR}/${ACCESSION}/${ACCESSION}_1.fastq\ --outFileNamePrefix ${OUTPUT_DIR}/${ACCESSION}/ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --soloType CB_UMI_Simple \ --soloCBwhitelist whitelist.txt

This is somewhat helpful to my workflow because I can at least run velocyto manually on the bam and barcode outputs. The major barrier now is that I have been so far unable to run cellranger count directly on the files because for whatever reason cellranger is not accepting my gtf transcriptome file. This particular error is outside the scope of this help request, but I think context around my overall workflow could be helpful.

GriffinBINF commented 10 months ago

Another quick update, I ran using human data and it worked fine with all of the expected soloFeatures outputs

alexdobin commented 9 months ago

Hi @GriffinBINF

This looks like a bug with Velocyto calculation for a non-trivial genome... Could you please send me the Log.out file for the failed run?

GriffinBINF commented 9 months ago

Hi Alex,

Thank you so much for looking into this for me. Here is the log file for the most recent run: Log.out.txt

I am investigating other potential causes like the architecture of the cluster I am using since some collaborators were able to obtain the counts using their own server.

Also, here is the sole error message from the .err file: line 8: 243470 Segmentation fault "${cmd}" "$@"

Please let me know if you can determine anything.

Best, Griffin

alexdobin commented 9 months ago

Hi Griffin,

I did not seen anything suspicious in the Log.out.txt file. If the same job was run successfully on a different server, it may indeed be a problem with the cluster.

Cheers Alex

GriffinBINF commented 9 months ago

Hi Alex,

It's unfortunate that it was not something more obvious. If it isnt too much trouble, do you know some good ways I could continue to troubleshoot and pin down where the issue is occurring with the cluster?

Additionally I was reading the documentation and it suggested reaching out to you/the team for jobs involving very large or small genomes. Do you have any parameter suggestions for the Axolotl genome that would differ from the default settings?

Thank you very much for your help.

Cheers, Griffin

alexdobin commented 8 months ago

Hi Griffin,

The genome index looks fine. I would recommend removing some parameters to see where the problem comes from. I would start with --twopassMode Basic.

alexdobin / STAR

Segmentation Fault with STARsolo on Large scRNAseq Dataset #1993

Remove soloFeatures