Closed mad-scientist-in-training closed 2 years ago
Hi @mad-scientist-in-training,
From your log and SQANTI3 command, there's a couple of things I think could be going wrong.
First, it seems that you are using /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa
as your genome file, however, these are only the transcript sequences. SQANTI3 expects a genome file, therefore, you should be using the "Genome sequence, primary assembly (GRCm39)" file from GENCODE.
Then there's also this warning:
WARNING: ref annotation contains chromosomes not in genome:
chr3,chrY,chr18,chr6,chr14,chr15,chr13,chrM,chr4,chrX,chr1,chr10,chr9,chr12,chr17,chr5,chr16,chr11,chr7,chr19,chr2,chr8
If anything, I would expect to see patch/haplotype IDs there, but in this case it has all chromosomes listed. I believe it has to do with no chr indicators being present in the transcripts.fa file that you supplied, since it is not a genome file.
Also, note that, even though it is not mandatory, we recommend feeding your transcriptome to SQANTI3 as a GTF, instead of a FASTA file.
I hope that helps, but if not, feel free to write back!
Ángeles
Ohhhh that makes a lot of sense!
I ran it with the correct files and it was a success!! Thank you so much for your help :D
Hi,
I'm running SQANTI3 on a mouse transcript PacBio dataset and I keep running into the following issue:
[+] Loading minimap2, version 2.24... R scripting front-end version 4.1.3 (2022-03-10) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.renamed.fasta [M::mm_idx_gen::7.9861.74] collected minimizers [M::mm_idx_gen::9.0592.60] sorted minimizers [M::main::9.1022.59] loaded/built the index for 142379 target sequence(s) [M::mm_mapopt_update::10.1342.43] mid_occ = 82 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 142379 [M::mm_idx_stat::10.7312.35] distinct minimizers: 32148497 (48.92% are singletons); average occurrences: 2.544; average spacing: 2.961; total length: 242202044 [M::worker_pipeline::18.0095.28] mapped 39193 sequences [M::main] Version: 2.24-r1122 [M::main] CMD: minimap2 -ax splice --secondary=no -C5 -uf -t 10 /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.renamed.fasta [M::main] Real time: 18.276 sec; CPU: 95.293 sec; Peak RSS: 3.474 GB output written to WARNING: ref annotation contains chromosomes not in genome: chr3,chrY,chr18,chr6,chr14,chr15,chr13,chrM,chr4,chrX,chr1,chr10,chr9,chr12,chr17,chr5,chr16,chr11,chr7,chr19,chr2,chr8
Parsing Isoforms.... RT-switching computation.... Full-length read abundance files not provided. Isoforms expression files not provided. Writing output files.... Generating SQANTI3 report.... Loading required package: Biobase Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following object is masked from ‘package:gridExtra’:
The following objects are masked from ‘package:stats’:
The following objects are masked from ‘package:base’:
Welcome to Bioconductor
Loading required package: splines Loading required package: Matrix
Attaching package: ‘Matrix’
The following object is masked from ‘package:reshape’:
Attaching package: ‘plyr’
The following objects are masked from ‘package:reshape’:
Attaching package: ‘plotly’
The following objects are masked from ‘package:plyr’:
The following object is masked from ‘package:reshape’:
The following object is masked from ‘package:ggplot2’:
The following object is masked from ‘package:stats’:
The following object is masked from ‘package:graphics’:
Attaching package: ‘dplyr’
The following objects are masked from ‘package:plyr’:
The following object is masked from ‘package:Biobase’:
The following objects are masked from ‘package:BiocGenerics’:
The following object is masked from ‘package:gridExtra’:
The following object is masked from ‘package:reshape’:
The following objects are masked from ‘package:stats’:
The following objects are masked from ‘package:base’:
summarise()
has grouped output by 'lenCat'. You can override using the.groups
argument. Error in[<-.data.frame
(*tmp*
, which(x$diff_to_gene_TSS <= 50), "Annotation", : replacement has 1 row, data has 0 Calls: [<- -> [<-.data.frame Execution halted Write arguments to /data/lakecm/transcriptome_assembly/mouse/output/sqanti_output.params.txt... Running SQANTI3... Parsing provided files.... Reading genome fasta /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa.... Aligning reads with Minimap2... Predicting ORF sequences... Parsing Reference Transcriptome.... Splice Junction Coverage files not provided. TSS ratio will not be calculated since SR information was not provided **** Performing Classification of Isoforms.... Number of classified isoforms: 39413 Traceback (most recent call last): File "/gpfs/gsfs8/users/lakecm/SQANTI3/sqanti3_qc.py", line 2461, inInterestingly I've run the exact same script with human files and have successfully completed the SQANTI3 run - so I'm wondering if there is an issue with the Mouse transcriptome reference fasta and gtf files? This is where I obtained the mouse files: https://www.gencodegenes.org/mouse/ This is the complete script I'm running:
!/bin/bash
source /data/lakecm/conda/etc/profile.d/conda.sh conda activate SQANTI3.env module load minimap2 export PATH=$PATH:/data/lakecm/SQANTI3/utilities/gtfToGenePred export PYTHONPATH=$PYTHONPATH:/data/lakecm/SQANTI3/cDNA_Cupcake/ export PYTHONPATH=$PYTHONPATH:/data/lakecm/SQANTI3/cDNA_Cupcake/sequence python sqanti3_qc.py -o sqanti_output --aligner_choice minimap2 -d /data/lakecm/transcriptome_assembly/mouse/output --report pdf --isoAnnotLite --fasta /data/lakecm/transcriptome_assembly/mouse/combined_corrected_collapsed_mouse.collapsed.rep.fa /data/lakecm/STAR/mouse/gencode.vM29.annotation.gtf /data/lakecm/STAR/mouse/gencode.vM29.transcripts.fa
Any help you can provide with this issue will be greatly appreciated!
Best,
Camille