bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
990 stars 355 forks source link

issue with STAR alignment crashes salmon #3671

Open mjsduncan opened 2 years ago

mjsduncan commented 2 years ago

this isn't really a bcbio problem but i thought i would give a heads up: using bcbio 1.9.0 bulk RNA-seq pipeline with BDGP6 genome, salmon crashes because some transcript lengths in the reference fasta don't match in the STAR aligned bam file. this has been seen at least once before and i posted code details to an open issue thread in the STAR repo: https://github.com/alexdobin/STAR/issues/1140

i couldn't turn of salmon in the yaml file with a tools_off line, so to get the run to complete i did a fast RNAseq analysis on the samples without an aligner and then linked the output from that into the work/salmon directory of the original run. should i be concered about any effects in downstream analysis because STAR has been removed from the input pipeline to salmon?

naumenko-sa commented 2 years ago

Hi @mjsduncan ! Thanks for the heads up! We ran several projects with BDGP reference using several different transcriptome annotations.

Sometimes installing a custom transcriptome reference with: https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#adding-custom-genomes reveals issues with the gtf/gff file (incorrect entries). Those happen, and sometimes need to be found and removed.

cc @Gammerdinger who recently went through a similar issue.

@mjsduncan what reference did you use - our standard in bcbio (https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/BDGP6/transcripts.yaml) or a custom one?

Salmon makes a pseudo-alignment not based on the STAR bam file, unless https://bcbio-nextgen.readthedocs.io/en/latest/contents/bulk_rnaseq.html#parameters quantify_genome_alignments: true is specified. You can definitely generate counts without using this option.

SN

naumenko-sa commented 1 year ago

Hi @mjsduncan !

We also have hit this problem in one of the projects and solved it by

by generating ref-transcripts.full.fa out of the genome reference and ref-transcripts.gtf in genome/Celegans/WBCel235 in our case by using gffread (it was also a custom genome). It indeed gives longer sequences per transcript and is compatible with STAR transcriptome alignments.

SN

mjsduncan commented 1 year ago

thanks for the response, @naumenko-sa! i'll give this a shot. will this be included in an update of the cloud bionlinux recipes?