ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
198 stars 49 forks source link

SQANTI_qc error #60

Closed rmenafra closed 2 years ago

rmenafra commented 3 years ago

After multiple attempts of running SQANTI without having an out of memory error from our slurm cluster, I've got the following standard output:

[continue after the minimap2 log]
[M::main] Real time: 36935.240 sec; CPU: 66543.089 sec; Peak RSS: 18.821 GB
Error: invalid feature coordinates (end<start!) at line:
chr2    GRCh38  exon    233648233       233648232       .       +       .       ID=PB.14154.188.exon2;Name=PB.14154.188.exon2;Parent=PB.14154.188
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
GeneMarkS: error on last system call, error code 134
Abort program!!!
Cleaning up isoform IDs...
Cleaned up isoform fasta file written to: /exports/lgtc/projects/Roberta/projects/Isoseq3/Noel/Project_202003799/collapse/merged/files_to_chain/chain_results/all_samples.chained.rep.renamed.fasta
output written to 
Write arguments to /exports/lgtc/projects/Roberta/projects/Isoseq3/Noel/Project_202003799/collapse/merged/files_to_chain/chain_results/sqanti3/all_samples.chained.rep.params.txt...
**** Running SQANTI3...
**** Parsing provided files....
Reading genome fasta /exports/lgtc/projects/Roberta/projects/Isoseq3/annotation/GRCh38.p13.genome.fa....
****Aligning reads with Minimap2...
**** Predicting ORF sequences...
Traceback (most recent call last):
  File "/exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py", line 2334, in <module>
    main()
  File "/exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py", line 2318, in main
    run(args)
  File "/exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py", line 1745, in run
    orfDict = correctionPlusORFpred(args, genome_dict)
  File "/exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py", line 581, in correctionPlusORFpred
    if subprocess.check_call(cmd, shell=True, cwd=gmst_dir)!=0:
  File "/exports/lgtc/projects/Roberta/miniconda3/envs/SQANTI3.conda_env/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'perl /exports/lgtc/projects/Roberta/SQANTI3/utilities/gmst/gmst.pl -faa --strand direct --fnn --output /exports/lgtc/projects/Roberta/projects/Isoseq3/Noel/Project_202003799/collapse/merged/files_to_chain/chain_results/sqanti3/GMST/GMST_tmp /exports/lgtc/projects/Roberta/projects/Isoseq3/Noel/Project_202003799/collapse/merged/files_to_chain/chain_results/sqanti3/all_samples.chained.rep_corrected.fasta' returned non-zero exit status 1.

In this case I'm not sure if the error is related to an invalid feature coordinate or still some sort of memory issue (for the "terminate called after throwing an instance of 'std::bad_alloc'). If it's relative to an invalid feature coordinate, what's the file causing it?

My input fasta is about 53Gb and I'm allocating a total of 700Gb over 2 CPUs. All my previous attempts were killed by the slurm cluster because they were exceeding the allocated memory (I tried with 10cpus and 68Gb each for a total of 680Gb). Any feedback would be appreciated.

Thanks Roberta

Magdoll commented 3 years ago

Hi @rmenafra

It seems like at least one of the error is related to the input GFF having the exon start coordinate greater than the end coordinate

chr2    GRCh38  exon    233648233       233648232   

Not sure if that led to the other error though What is the SQANTI3 command you are running?

rmenafra commented 3 years ago

Hi Liz, the GFF was output of the chain_samples with default parameters. The command was as follows:

#!/bin/bash
#SBATCH -J SQANTI3_qc
#SBATCH -D ./
#SBATCH -t unlimited
#SBATCH -p highmem,all
#SBATCH --mem=350000
#SBATCH --ntasks-per-node=2

INPUT_FA=../chain_results/all_samples.chained.rep.fasta
REF_GTF=annotation/gencode.v38.annotation.gtf
REF_FA=annotation/GRCh38.p13.genome.fa
COUNT=../chain_results/all_samples.chained_count.txt
source activate SQANTI3.conda_env
module load statistical/R/3.6.2/gcc.8.3.1
export PYTHONPATH=/exports/lgtc/projects/Roberta/cDNA_Cupcake/:/exports/lgtc/projects/Roberta/cDNA_Cupcake/sequence

python /exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py --cpus 2  $INPUT_FA $REF_GTF $REF_FA --fl_count $COUNT --aligner_choice=minimap2 --polyA_motif_list human.polyA.list.txt  --cage_peak hg38.cage_peak_phase1and2combined_coord.bed -c intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
Magdoll commented 3 years ago

Hi @rmenafra Can you check the isoform that has the offending coordinate?

chr2    GRCh38  exon    233648233       233648232   

If you can trace where the original input came from (before chaining),....we probably need to solve it from there. -Liz

rmenafra commented 3 years ago

Before chaining the 6 samples, only 1 gff has a similar isoform:

chr2    PacBio  exon    233648232   233648233   .   +   .   gene_id "PB.7690"; transcript_id "PB.7690.196";

so start and end seem to be swapped compared to the error. In the chained.gff I have:

chr2    PacBio  transcript  233636487   233773294   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233636487   233636921   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233648232   233648277   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233682382   233682792   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233767034   233767165   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233767849   233767936   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233768220   233768439   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";
chr2    PacBio  exon    233772262   233773294   .   +   .   gene_id "PB.7690"; transcript_id "PB.14154.188";

where only the offending isoform name (PB.14154.188) corresponds to the error.

Magdoll commented 3 years ago

I'm confused. The post-chaining transcript PB.14154.188 seems legit. So where did the offending PB.7690.196 come from?

rmenafra commented 3 years ago

In one of the pre-chaining gff (as you suggested to check before chaining) , but sqanti reports it as 'chr2 PacBio exon 233648233 233648232' whereas in the file before chaining is 'chr2 PacBio exon 233648232 233648233' (strand is the same).

Magdoll commented 3 years ago

ohhhhh!!! quick q then - did you run sqanti3_qc.py with fasta or GTF as input?

rmenafra commented 3 years ago

I gave the fasta as input (command attached before):

python /exports/lgtc/projects/Roberta/SQANTI3/sqanti3_qc.py --cpus 2  $INPUT_FA $REF_GTF $REF_FA --fl_count $COUNT --aligner_choice=minimap2 --polyA_motif_list human.polyA.list.txt  --cage_peak hg38.cage_peak_phase1and2combined_coord.bed -c intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
Magdoll commented 3 years ago

@rmenafra can you run it with --gtf all_samples.chained.gff.

I strongly discourage anyone to run SQANTI3 with fasta input these days

rmenafra commented 3 years ago

I will try with --gtf and let you know. I would be curious to know why you would strongly discourage fasta input these days, I thought that .fa, .fq and .gtf were all included as an input in the documentation.

Magdoll commented 3 years ago

Hi @rmenafra let me know how it goes.

Yes in the early days of SQANTI we used input fasta and re-aligned it to generate the "corrected" fasta/GFF, but over time, this really presents a problem because someone could use very different aligners + parameters to arrive at the input fasta. This discrepancy between the input fasta and what you thought the GFF would be is exactly the issue we are seeing right now. In terms of tool dev, it is also bad form for SQANTI -- which is a tool about classifying isoforms NOT about maintaining the right set of aligners and parameters to use for different genomes --- to continue to accept input fasta. It needs to be separated so that it is on the user to provide the proper GFF.

Hope that makes sense. It is my goal to eventually gut the entire input fasta and only allow GFF input. -Liz

rmenafra commented 3 years ago

Hi Liz, thanks for the explanation, now it makes sense to me. After following your suggestions I still get an error, attached the full log slurm-2719992.out.zip

Magdoll commented 3 years ago

hi @rmenafra , the error calls from ORF prediction using Genemark. Which, by now, is probably 99% of the SQANTI3 errors ....and Genemark is a black box.

A few options:

  1. Run SQANTI3 without ORF. using --skipORF option
  2. Send me the fasta file all_samples.chained_corrected.fasta and I can attempt to see why Genemark hates it. No guarantee I can fix it. I probably can't. But I'm thinking maybe I need to write a "safety net" portion of SQANTI3 to left certain sequences softly fail, if that is possible at all, for Genemark.

-Liz