Investigate different STAR parameters

kopardev commented 2 years ago

STAR is missing alignments to some key RNAs (eg. Rnu6). These are found 100s of times in the genome and this may be the reason STAR is missing them... If read aligns to too many loci STAR reports it as unmapped. Parameters to try --outFilterMultimapNmax 10000 \ --winAnchorMultimapNmax 10000 \ --seedPerReadNmax 10000 \ --seedPerWindowNmax 10000 \ from https://github.com/alexdobin/STAR/issues/1372 and --alignWindowsPerReadNmax default: 10000 int>0: max number of windows per read --alignTranscriptsPerReadNmax default: 10000 int>0: max number of different alignments per read to consider --winAnchorMultimapNmax default: 50 int>0: max number of loci anchors are allowed to map to --seedMultimapNmax default: 10000 int>0: only pieces that map fewer than this value are utilized in the stitching procedure --seedPerReadNmax default: 1000 int>0: max number of seeds per read --seedPerWindowNmax default: 50 int>0: max number of seeds per window --seedNoneLociPerWindow default: 10 int>0: max number of one seed loci per window

slsevilla commented 2 years ago

Complete testing on various STAR parameters, while investigating issues with overhangs and gaps. All code is available within build/STAR_testing.

Download reference files

Download all reference files from source run_star.sh with flag flag_download="Y"

Index reference files

Create indexed reference files to feed during alignment run_star.sh with flag flag_index="Y"

Use file to create STAR params listing

Update /data/RBL_NCI/Wolin/mES_fclip_1_YL_012122/alignment_analysis/star/docs/Star_variations.txt run_star.sh with flag flag_variables="Y" to create variable_set.txt file which has all of the STAR variables needed

Run STAR alignment with partial sample

Using input FLAG_Ro_fclip.dedup.si.Sptan1.unique.fastq run star variables 1A 1c 2b 2d 2f 2h ClipSeqTools_v2 1b 2a 2c 2e 2g 2i ClipSeqTools_v1 clipv1_double_raiseq Encode1 run_star.sh with flag flag_align_partial="Y" - done on an interactive node

Run analysis of gaps in these samples run_star.sh with flag flag_gap_partial="Y" cigar_plotting_partial.Rmd

Run STAR alignment with complete sample

Using input /data/RBL_NCI/Wolin/mES_fclip_1_YL_012122/01_preprocess/FLAG_Ro_fclip_filtered.fastq run star variables 2e 2i 2j ClipSeqTools_v1 2h ClipSeqTools_v2 run_star.sh with flag flag_align_complete="Y" - sends swarm to cluster

Run analysis of gaps in these samples run_star.sh with flag flag_gap_complete="N"

Run analysis of alignment stats in these samples run_star.sh with flag flag_align_stats="N" cigar_plotting_complete_v1.Rmd

Run iCLIP pipeline in these samples save to complete_sample/pipeline dir

Run STAR alignment with rnu6 sample

Using input /data/RBL_NCI/Wolin/mES_fclip_1_YL_012122/01_preprocess/FLAG_Ro_fclip_filtered.fastq subset sample for rnu6 reads Used IGV session to randomly select geneIDs and saved to text file rnu6_readids.txt Remove @ onto readids for subsetting of novo bam file and save to text file rnu6_readids_piccard.txt run_star.sh with flag flag_subset_rnu="Y"

Run alignment of rnu6 run_star.sh with flag flag_align_rnu="Y"

Run analysis of alignment stats in these samples run_star.sh with flag flag_align_stats_rnu="N"

Run STAR alignment with complete sample

Expand complete sample testing with additional variables original: 2e 2i 2j ClipSeqTools_v1 2h ClipSeqTools_v2 added: clipv1_double 2e_double clipv1_triple add new def "expanded" to include additional variables not prev tested run_star.sh with flag flag_align_complete="Y" edit for only necessary variables - sends swarm to cluster

Run analysis of gaps in these samples run_star.sh with flag flag_gap_complete="Y" cigar_plotting_complete_v2.Rmd

Run iCLIP pipeline in these samples save to complete_sample/pipeline_v2 dir

Due to cluster being down create subset BAM to transfer files to desktop Subsetting for Rnu6, Sptan1, sympk, Gm24204 run_star.sh with flag flag_subset_bam_multiple_genes="Y"

Run STAR alignment with partial sample

Subsetting for Rnu6, Sptan1, sympk, Gm24204, GAPDH, ACTB

Used IGV session to randomly select geneIDs for original subset (Rnu6, Sptan1, sympk, Gm24204) and then gene ranges to select all genes for GAPDH,ACTB as requested by Marco run_star.sh with flag flag_subset_fq_multiple_genes="Y" to submit job (930) to cluster to create one FQ with all genes

Run alignment with variables (expanded_gene_list): clipfinal_10 clipfinal_12 clipfinal_6 clipfinal_8 clipfinal_11 clipfinal_5 clipfinal_7 clipfinal_9 run_star.sh with flag flag_align_overhang="Y" will submit to cluster

Run analysis of gaps in these samples run_star.sh with flag flag_gap_overhang="Y" cigar_plotting_partial_v2.Rmd

slsevilla commented 2 years ago

parameter testing code, and output reports, added with commit 5ff1175fefd14bd9ea5a904726999e8599a871ba

slsevilla commented 2 years ago

Based on the above results, STAR pipeline default parameters added to workflow via snakemake_config.yaml file (commit 6afd36da159ec8f20c607e866292848f5bc8ade4)

alignEndsType: "Local"
alignIntronMax: 50000
alignSJDBoverhangMin: 3 # minimum overhang value for annotated spliced junctions
alignSJoverhangMin: 5 # minimum overhang value for non-cannonical splied junctions
alignTranscriptsPerReadNmax: 10000
alignWindowsPerReadNmax: 10000
outFilterMatchNmin: 15
outFilterMatchNminOverLread: 0.9
outFilterMismatchNmax: 999
outFilterMismatchNoverReadLmax: 0.04
outFilterMultimapNmax: 10000
outFilterMultimapScoreRange: 0
outFilterScoreMin: 0
outFilterType: "Normal"
outSAMattributes: "All"
outSAMunmapped: "None"
outSJfilterCountTotalMin: "3 1 1 1"
outSJfilterOverhangMin: "30 12 12 12"
outSJfilterReads: "All"
seedMultimapNmax: 10000
seedNoneLociPerWindow: 20
seedPerReadNmax: 10000
seedPerWindowNmax: 10000
sjdbScore: 2
winAnchorMultimapNmax: 10000

slsevilla commented 2 years ago

When STAR parameters ran with complete project samples, errors were noted with the two parameters: seedPerwindowNmax and winAnchorMultimapNmax. Discussed in issue.

Default settings were changed for both parameters with commit d485f583aff422e4434de1edd57ba7ee37bee0fa.

slsevilla commented 2 years ago

References used to determine parameters:

All HTML markdown reports are located in build/STAR directory

STAR has been implemented in version 2.0 of pipeline

NCI-RBL / iCLIP