Segmentation fault during dataset creation step

JackCollora commented 2 years ago

Hello, I'm having an issue running this pipeline with a new genome index. This install works very well when aligning to human, but with a STAR index/genome/GTF file for another species it fails after logging

INFO:STPipeline: Starting creating dataset 2021-12-13 17:18:47.819617

This is what it outputs to standard out

[bam_sort_core] merging from 0 files and 20 in-memory blocks... /var/spool/slurmd/job21176315/slurm_script: line 57: 71249 Segmentation fault st_pipeline_run.py --output-folder $OUTPUT --ids $ID --ref-map $MAP --ref-annotation $ANN --expName $sample --htseq-no-ambiguous --verbose --log-file $OUTPUT/${sample}_log.txt --demultiplexing-kmer 5 --threads 20 --temp-folder $TMP_ST --no-clean-up --umi-start-position 16 --umi-end-position 26 --demultiplexing-overhang 0 --min-length-qual-trimming 20 $FW $RV

Thus far I've tried mapping/counting running STAR and HTSeq outside of the pipeline, and they do run without error in that context.

Here is the complete log

INFO:STPipeline:ST Pipeline 1.8.1 INFO:STPipeline:Output directory: /gpfs/ysm/project/ya-chi_ho/kma57/sample_dir_000006867/Sample_NTC/output INFO:STPipeline:Temporary directory: /gpfs/ysm/project/ya-chi_ho/kma57/sample_dir_000006867/Sample_NTC/output/tmp INFO:STPipeline:Dataset name: NTC INFO:STPipeline:Forward(R1) input file: /gpfs/ysm/project/ya-chi_ho/kma57/sample_dir_000006867/Sample_NTC/tmp/NTC_R2_processed.fastq INFO:STPipeline:Reverse(R2) input file: /gpfs/ysm/project/ya-chi_ho/kma57/sample_dir_000006867/Sample_NTC/tmp/NTC_R1_filtered.fastq.gz INFO:STPipeline:Reference mapping STAR index folder: /gpfs/ysm/home/kma57/genome/RM_SIV/STAR INFO:STPipeline:Reference annotation file: /gpfs/ysm/home/kma57/genome/RM_SIV/GCF_003339765.1_Mmul_10_genomic.gtf INFO:STPipeline:CPU Nodes: 20 INFO:STPipeline:Ids(barcodes) file: /gpfs/ysm/home/kma57/genome/spatial_barcodes.txt INFO:STPipeline:TaggD allowed mismatches: 2 INFO:STPipeline:TaggD kmer size: 5 INFO:STPipeline:TaggD overhang: 0 INFO:STPipeline:TaggD metric: Subglobal INFO:STPipeline:Mapping reverse trimming: 0 INFO:STPipeline:Mapping inverse reverse trimming: 0 INFO:STPipeline:Mapping tool: STAR INFO:STPipeline:Mapping minimum intron size allowed (splice alignments) with STAR: 1 INFO:STPipeline:Mapping maximum intron size allowed (splice alignments) with STAR: 1 INFO:STPipeline:STAR genome loading strategy NoSharedMemory INFO:STPipeline:Annotation tool: HTSeq INFO:STPipeline:Annotation mode: intersection-nonempty INFO:STPipeline:Annotation strandness yes INFO:STPipeline:UMIs start position: 16 INFO:STPipeline:UMIs end position: 26 INFO:STPipeline:UMIs allowed mismatches: 1 INFO:STPipeline:UMIs clustering algorithm: AdjacentBi INFO:STPipeline:Allowing an offset of 250 when clustering UMIs by strand-start in a gene-spot INFO:STPipeline:Allowing 6 low quality bases in an UMI INFO:STPipeline:Discarding reads that after trimming are shorter than 20 INFO:STPipeline:Removing polyA sequences of a length of at least: 10 INFO:STPipeline:Removing polyT sequences of a length of at least: 10 INFO:STPipeline:Removing polyG sequences of a length of at least: 10 INFO:STPipeline:Removing polyC sequences of a length of at least: 10 INFO:STPipeline:Removing polyN sequences of a length of at least: 10 INFO:STPipeline:Allowing 0 mismatches when removing homopolymers INFO:STPipeline:Remove reads whose AT content is 90% INFO:STPipeline:Remove reads whose GC content is 90% INFO:STPipeline:Starting the pipeline: 2021-12-13 16:36:29.608163 INFO:STPipeline:Start filtering raw reads 2021-12-13 16:36:29.627480 INFO:STPipeline:Trimming stats total reads (pair): 81470284 INFO:STPipeline:Trimming stats 4122973 reads have been dropped! INFO:STPipeline:Trimming stats you just lost about 5.06% of your data INFO:STPipeline:Trimming stats reads remaining: 77347311 INFO:STPipeline:Trimming stats dropped pairs due to incorrect UMI: 0 INFO:STPipeline:Trimming stats dropped pairs due to low quality UMI: 121432 INFO:STPipeline:Trimming stats dropped pairs due to high AT content: 2105513 INFO:STPipeline:Trimming stats dropped pairs due to high GC content: 39 INFO:STPipeline:Trimming stats dropped pairs due to presence of artifacts: 1778429 INFO:STPipeline:Trimming stats dropped pairs due to being too short: 117560 INFO:STPipeline:Starting genome alignment 2021-12-13 17:01:37.963875 INFO:STPipeline:Mapping stats: INFO:STPipeline:Mapping stats are computed from all the pair reads present in the raw files INFO:STPipeline: Uniquely mapped reads number | 663018 INFO:STPipeline: Uniquely mapped reads % | 0.86% INFO:STPipeline: Number of reads mapped to multiple loci | 139153 INFO:STPipeline: % of reads mapped to multiple loci | 0.18% INFO:STPipeline: % of reads unmapped: too short | 98.73% INFO:STPipeline:Total mapped reads: 802171 INFO:STPipeline:Starting barcode demultiplexing 2021-12-13 17:16:42.503838 INFO:STPipeline:Demultiplexing Mapping stats: INFO:STPipeline:# Total reads: 802171 INFO:STPipeline:# Total reads written: 718743 INFO:STPipeline:# Ambiguous matches: 10508 [1.309945136386132%] INFO:STPipeline:# - Non-unique ambiguous matches: 23405 INFO:STPipeline:# Unmatched: 12272 [1.529848373975125%] INFO:STPipeline:Starting annotation 2021-12-13 17:17:03.172980 INFO:STPipeline:Annotated reads: 480326 INFO:STPipeline:Starting creating dataset 2021-12-13 17:18:47.819617

Any suggestions are appreciated.

jfnavarro commented 2 years ago

Hi Jack! Thanks for your email.

First of all, I suggest that you work with the most updated version of the repository at https://github.com/jfnavarro/st_pipeline

Regarding your error. The "created dataset" step can be memory heavy though it should not be a problem given what I see in the log. It does not output anything to standard err? Have you tried a different UMI algorithm? Maybe you could also increase the memory limit for the job in Slurm just to make sure, otherwise I would suggest looking at "annotated.bam" and running the create dataset step manually (I can guide you) in order to determine what is going on. Although, maybe the standard err provides some useful information to debug.

Best, Jose

JackCollora commented 2 years ago

Hi Jose,

I'll try updating the installation to the most recent repository.

This is the standard err, my mistake labeling it as standard out: [bam_sort_core] merging from 0 files and 20 in-memory blocks... /var/spool/slurmd/job21176315/slurm_script: line 57: 71249 Segmentation fault st_pipeline_run.py --output-folder $OUTPUT --ids $ID --ref-map $MAP --ref-annotation $ANN --expName $sample --htseq-no-ambiguous --verbose --log-file $OUTPUT/${sample}_log.txt --demultiplexing-kmer 5 --threads 20 --temp-folder $TMP_ST --no-clean-up --umi-start-position 16 --umi-end-position 26 --demultiplexing-overhang 0 --min-length-qual-trimming 20 $FW $RV

Nothing is printed to standard out.

For the Slurm job we've gone up to 190Gb and received the same error. Watching the job with top did not show any usage above ~20Gb.

How can we go about trying the create a dataset manually?

Best, Jack

On Tue, Dec 14, 2021 at 11:23 AM José Fernández Navarro < @.***> wrote:

Hi Jack! Thanks for your email.

First of all, I suggest that you work with the most updated version of the repository at https://github.com/jfnavarro/st_pipeline

Regarding your error. The "created dataset" step can be memory heavy though it should not be a problem given what I see in the log. It does not output anything to standard err? Have you tried a different UMI algorithm? Maybe you could also increase the memory limit for the job in Slurm just to make sure, otherwise I would suggest looking at "annotated.bam" and running the create dataset step manually (I can guide you) in order to determine what is going on. Although, maybe the standard err provides some useful information to debug.

Best, Jose

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SpatialTranscriptomicsResearch/st_pipeline/issues/126#issuecomment-993710860, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANZ4L67PASYJGGDFIVMFEMDUQ5VOVANCNFSM5KA2JVSA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jack Collora @.*** Graduate Student Gruber Fellow Laboratory of Ya-Chi Ho B.S. Microbiology, Immunology, and Molecular Genetics UCLA 2017

jfnavarro commented 2 years ago

I really do not think it is related to memory, I guess it is not related to I/O? I would try first to use a different UMI algorithm and if that givesthe same results you can run the createDataset step manually by importing datasets.py from stpipeline.core.pipeline and then calling this function:

  from stpipeline.common.stats import qa_stats

  createDataset('path/to/annotated.bam",
                qa_stats,  # Passed as reference
                self.ref_annotation,
                self.umi_cluster_algorithm,
                self.umi_allowed_mismatches,
                self.umi_counting_offset,
                self.disable_umi,
                self.output_folder,
                self.expName,
                True)  # Verbose

Just update the parameters accordingly.

SpatialTranscriptomicsResearch / st_pipeline

Segmentation fault during dataset creation step #126