mad-scientist-in-training commented 2 years ago

Hi,

I'm running SQANTI3 on a mouse transcript PacBio dataset and I keep running into the following issue:

[+] Loading minimap2, version 2.24... R scripting front-end version 4.1.3 (2022-03-10) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.renamed.fasta [M::mm_idx_gen::7.9861.74] collected minimizers [M::mm_idx_gen::9.0592.60] sorted minimizers [M::main::9.1022.59] loaded/built the index for 142379 target sequence(s) [M::mm_mapopt_update::10.1342.43] mid_occ = 82 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 142379 [M::mm_idx_stat::10.7312.35] distinct minimizers: 32148497 (48.92% are singletons); average occurrences: 2.544; average spacing: 2.961; total length: 242202044 [M::worker_pipeline::18.0095.28] mapped 39193 sequences [M::main] Version: 2.24-r1122 [M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 10 /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.renamed.fasta [M::main] Real time: 18.276 sec; CPU: 95.293 sec; Peak RSS: 3.474 GB output written to WARNING: ref annotation contains chromosomes not in genome: chr3,chrY,chr18,chr6,chr14,chr15,chr13,chrM,chr4,chrX,chr1,chr10,chr9,chr12,chr17,chr5,chr16,chr11,chr7,chr19,chr2,chr8

Parsing Isoforms.... RT-switching computation.... Full-length read abundance files not provided. Isoforms expression files not provided. Writing output files.... Generating SQANTI3 report.... Loading required package: Biobase Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following object is masked from ‘package:gridExtra’:

combine

The following objects are masked from ‘package:stats’:

IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: splines Loading required package: Matrix

Attaching package: ‘Matrix’

The following object is masked from ‘package:reshape’:

expand

Attaching package: ‘plyr’

The following objects are masked from ‘package:reshape’:

rename, round_any

Attaching package: ‘plotly’

The following objects are masked from ‘package:plyr’:

arrange, mutate, rename, summarise

The following object is masked from ‘package:reshape’:

rename

The following object is masked from ‘package:ggplot2’:

last_plot

The following object is masked from ‘package:stats’:

filter

The following object is masked from ‘package:graphics’:

layout

Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize

The following object is masked from ‘package:Biobase’:

combine

The following objects are masked from ‘package:BiocGenerics’:

combine, intersect, setdiff, union

The following object is masked from ‘package:gridExtra’:

combine

The following object is masked from ‘package:reshape’:

rename

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

summarise() has grouped output by 'lenCat'. You can override using the .groups argument. Error in [<-.data.frame(*tmp*, which(x$diff_to_gene_TSS <= 50), "Annotation", : replacement has 1 row, data has 0 Calls: [<- -> [<-.data.frame Execution halted Write arguments to /data/lakecm/transcriptome_assembly/mouse/output/sqanti_output.params.txt... Running SQANTI3... Parsing provided files.... Reading genome fasta /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa.... Aligning reads with Minimap2... Predicting ORF sequences... Parsing Reference Transcriptome.... Splice Junction Coverage files not provided. TSS ratio will not be calculated since SR information was not provided **** Performing Classification of Isoforms.... Number of classified isoforms: 39413 Traceback (most recent call last): File "/gpfs/gsfs8/users/lakecm/SQANTI3/sqanti3_qc.py", line 2461, in main() File "/gpfs/gsfs8/users/lakecm/SQANTI3/sqanti3_qc.py", line 2444, in main run(args) File "/gpfs/gsfs8/users/lakecm/SQANTI3/sqanti3_qc.py", line 2053, in run if subprocess.check_call(cmd, shell=True)!=0: File "/data/lakecm/conda/envs/SQANTI3.env/lib/python3.9/subprocess.py", line 373, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/data/lakecm/conda/envs/SQANTI3.env/bin/Rscript /gpfs/gsfs8/users/lakecm/SQANTI3/utilities/SQANTI3_report.R /data/lakecm/transcriptome_assembly/mouse/output/sqanti_output_classification.txt /data/lakecm/transcriptome_assembly/mouse/output/sqanti_output_junctions.txt /data/lakecm/transcriptome_assembly/mouse/output/sqanti_output.params.txt /gpfs/gsfs8/users/lakecm/SQANTI3/utilities False pdf' returned non-zero exit status 1.

Interestingly I've run the exact same script with human files and have successfully completed the SQANTI3 run - so I'm wondering if there is an issue with the Mouse transcriptome reference fasta and gtf files? This is where I obtained the mouse files: https://www.gencodegenes.org/mouse/ This is the complete script I'm running:

!/bin/bash

source /data/lakecm/conda/etc/profile.d/conda.sh conda activate SQANTI3.env module load minimap2 export PATH=$PATH:/data/lakecm/SQANTI3/utilities/gtfToGenePred export PYTHONPATH=$PYTHONPATH:/data/lakecm/SQANTI3/cDNA_Cupcake/ export PYTHONPATH=$PYTHONPATH:/data/lakecm/SQANTI3/cDNA_Cupcake/sequence python sqanti3_qc.py -o sqanti_output --aligner_choice minimap2 -d /data/lakecm/transcriptome_assembly/mouse/output --report pdf --isoAnnotLite --fasta /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.fa /data/lakecm/STAR/mouse/gencode.vM29.annotation.gtf /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa

Any help you can provide with this issue will be greatly appreciated!

Best,

Camille

aarzalluz commented 2 years ago

Hi @mad-scientist-in-training,

From your log and SQANTI3 command, there's a couple of things I think could be going wrong.

First, it seems that you are using /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa as your genome file, however, these are only the transcript sequences. SQANTI3 expects a genome file, therefore, you should be using the "Genome sequence, primary assembly (GRCm39)" file from GENCODE.

Then there's also this warning:

WARNING: ref annotation contains chromosomes not in genome: 
chr3,chrY,chr18,chr6,chr14,chr15,chr13,chrM,chr4,chrX,chr1,chr10,chr9,chr12,chr17,chr5,chr16,chr11,chr7,chr19,chr2,chr8

If anything, I would expect to see patch/haplotype IDs there, but in this case it has all chromosomes listed. I believe it has to do with no chr indicators being present in the transcripts.fa file that you supplied, since it is not a genome file.

Also, note that, even though it is not mandatory, we recommend feeding your transcriptome to SQANTI3 as a GTF, instead of a FASTA file.

I hope that helps, but if not, feel free to write back!

Ángeles

mad-scientist-in-training commented 2 years ago

Ohhhh that makes a lot of sense!

I ran it with the correct files and it was a success!! Thank you so much for your help :D

ConesaLab / SQANTI3

Issue with Mouse transcriptome CalledProcessError - returned non-zero exit status 1 #147

!/bin/bash