Closed rmenafra closed 2 years ago
Hi @rmenafra
It seems like at least one of the error is related to the input GFF having the exon start coordinate greater than the end coordinate
chr2 GRCh38 exon 233648233 233648232
Not sure if that led to the other error though What is the SQANTI3 command you are running?
Hi Liz, the GFF was output of the chain_samples with default parameters. The command was as follows:
#!/bin/bash
#SBATCH -J SQANTI3_qc
#SBATCH -D ./
#SBATCH -t unlimited
#SBATCH -p highmem,all
#SBATCH --mem=350000
#SBATCH --ntasks-per-node=2
INPUT_FA=../chain_results/all_samples.chained.rep.fasta
REF_GTF=annotation/gencode.v38.annotation.gtf
REF_FA=annotation/GRCh38.p13.genome.fa
COUNT=../chain_results/all_samples.chained_count.txt
source activate SQANTI3.conda_env
module load statistical/R/3.6.2/gcc.8.3.1
export PYTHONPATH=/exports/lgtc/projects/Roberta/cDNA_Cupcake/:/exports/lgtc/projects/Roberta/cDNA_Cupcake/sequence
python /exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py --cpus 2 $INPUT_FA $REF_GTF $REF_FA --fl_count $COUNT --aligner_choice=minimap2 --polyA_motif_list human.polyA.list.txt --cage_peak hg38.cage_peak_phase1and2combined_coord.bed -c intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
Hi @rmenafra Can you check the isoform that has the offending coordinate?
chr2 GRCh38 exon 233648233 233648232
If you can trace where the original input came from (before chaining),....we probably need to solve it from there. -Liz
Before chaining the 6 samples, only 1 gff has a similar isoform:
chr2 PacBio exon 233648232 233648233 . + . gene_id "PB.7690"; transcript_id "PB.7690.196";
so start and end seem to be swapped compared to the error. In the chained.gff I have:
chr2 PacBio transcript 233636487 233773294 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233636487 233636921 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233648232 233648277 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233682382 233682792 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233767034 233767165 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233767849 233767936 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233768220 233768439 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2 PacBio exon 233772262 233773294 . + . gene_id "PB.7690"; transcript_id "PB.14154.188";
where only the offending isoform name (PB.14154.188) corresponds to the error.
I'm confused. The post-chaining transcript PB.14154.188
seems legit. So where did the offending PB.7690.196
come from?
In one of the pre-chaining gff (as you suggested to check before chaining) , but sqanti reports it as 'chr2 PacBio exon 233648233 233648232' whereas in the file before chaining is 'chr2 PacBio exon 233648232 233648233' (strand is the same).
ohhhhh!!! quick q then - did you run sqanti3_qc.py with fasta or GTF as input?
I gave the fasta as input (command attached before):
python /exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py --cpus 2 $INPUT_FA $REF_GTF $REF_FA --fl_count $COUNT --aligner_choice=minimap2 --polyA_motif_list human.polyA.list.txt --cage_peak hg38.cage_peak_phase1and2combined_coord.bed -c intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
@rmenafra can you run it with --gtf all_samples.chained.gff
.
I strongly discourage anyone to run SQANTI3 with fasta input these days
I will try with --gtf and let you know. I would be curious to know why you would strongly discourage fasta input these days, I thought that .fa, .fq and .gtf were all included as an input in the documentation.
Hi @rmenafra let me know how it goes.
Yes in the early days of SQANTI we used input fasta and re-aligned it to generate the "corrected" fasta/GFF, but over time, this really presents a problem because someone could use very different aligners + parameters to arrive at the input fasta. This discrepancy between the input fasta and what you thought the GFF would be is exactly the issue we are seeing right now. In terms of tool dev, it is also bad form for SQANTI -- which is a tool about classifying isoforms NOT about maintaining the right set of aligners and parameters to use for different genomes --- to continue to accept input fasta. It needs to be separated so that it is on the user to provide the proper GFF.
Hope that makes sense. It is my goal to eventually gut the entire input fasta and only allow GFF input. -Liz
Hi Liz, thanks for the explanation, now it makes sense to me. After following your suggestions I still get an error, attached the full log slurm-2719992.out.zip
hi @rmenafra , the error calls from ORF prediction using Genemark. Which, by now, is probably 99% of the SQANTI3 errors ....and Genemark is a black box.
A few options:
--skipORF
optionall_samples.chained_corrected.fasta
and I can attempt to see why Genemark hates it. No guarantee I can fix it. I probably can't. But I'm thinking maybe I need to write a "safety net" portion of SQANTI3 to left certain sequences softly fail, if that is possible at all, for Genemark.-Liz
After multiple attempts of running SQANTI without having an out of memory error from our slurm cluster, I've got the following standard output:
In this case I'm not sure if the error is related to an invalid feature coordinate or still some sort of memory issue (for the "terminate called after throwing an instance of 'std::bad_alloc'). If it's relative to an invalid feature coordinate, what's the file causing it?
My input fasta is about 53Gb and I'm allocating a total of 700Gb over 2 CPUs. All my previous attempts were killed by the slurm cluster because they were exceeding the allocated memory (I tried with 10cpus and 68Gb each for a total of 680Gb). Any feedback would be appreciated.
Thanks Roberta