ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
197 stars 48 forks source link

Errors in running sqanti3_filter.py #259

Closed Upendra19993 closed 7 months ago

Upendra19993 commented 7 months ago

Hi all,

I am running sqanti3. I started with the example dataset you have provided. When running the filtering step, I am getting an error and warning messages. Kindly request to have a look and assist me in resolving this issue. I have copied the complete message for your reference nd the error messages are found at the end.

(base) [uqwwijes@bun101 SQANTI3_output_original_names_after_reinstallation2]$ sqanti3_filter.py ml UHR_chr22_classification.txt Rscript (R) version 4.3.1 (2023-06-16) Output directory not defined. All the outputs will be stored at /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2 directory Output name not defined. All the outputs will have the prefix UHR_chr22 Write arguments to /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_params.txt...

Running SQANTI3 filtering...

/sw/local/rocky8/noarch/qcif/software/miniconda3/envs/sqanti3_5.2/bin/Rscript /sw/local/rocky8/noarch/qcif/software/SQANTI3-5.2/utilities/filter/SQANTI3_MLfilter.R -c /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_classification.txt -o UHR_chr22 -d /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2 -t 0.8 -j 0.7 -i 60 -f False -e False -m False -z 3000

     SQANTI3 Machine Learning filter

CURRENT ML FILTER PARAMETERS:

[1] "sqanti_classif: /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_classification.txt" [2] "output: UHR_chr22" [3] "dir: /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2" [4] "percent_training: 0.8" [5] "threshold: 0.7" [6] "intrapriming: 60" [7] "force_fsm_in: FALSE" [8] "force_multi_exon: FALSE" [9] "intermediate_files: FALSE" [10] "max_class_size: 3000" [11] "help: FALSE"

    INITIAL ML CHECKS:

Reading SQANTI3 *_classification.txt file...

Checking data for mono and multi-exon transcripts...

     ***Note: ML filter can only be applied to multi-exon transcripts.

     3338 multi-exon transcript isoforms found in SQ3 classification file.

Checking input data for True Positive (TP) and True Negative (TN) sets...

    Warning message:
     Training set not provided -will be created from input data.

Using Novel Not In Catalog non-canonical isoforms as True Negatives for training.

     - Total NNC non-canonical isoforms: 288

Not enough (< 250) Reference Match transcript isoforms among FSM, all FSM transcripts will be used as Positive set.

     - Total FSM isoforms: 506

Balancing number of isoforms in TP and TN sets...

    Minimum set size: 288 transcripts.

    Sampled 288 transcripts to define final TP and TN sets.

Wrote generated TP and TN lists to files:

    /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_TP_list.txt

    /scratch/project_mnt/S0030/upendra/Sqanti3/Exampla_data/SQANTI3_output_original_names_after_reinstallation2/UHR_chr22_TN_list.txt

    ML DATA PREPARATION:

Aggregating FL counts across samples (if more than one sample is provided)...

Replacing NAs with appropriate values for ML...

Handling factor columns...

Handling integer columns...

Removing variables with near-zero variance... Removed columns: [1] "chrom" "RTS_stage" "n_indels" [4] "n_indels_junc" "dist_to_CAGE_peak" "within_CAGE_peak" [7] "dist_to_polyA_site" "within_polyA_site" "polyA_dist"

Removing highly correlated features... (correlation threshold = 0.9).

All correlations <= 0.9

    List of removed features:
    No features removed.

    RANDOM FOREST ALGORITHM RUN:

Creating positive and negative sets for classifier training and testing...

Finished creating training data set.

Partitioning data into training and test sets...

    Proportion of the data to be used for training: 0.8

Description of the training set:

    Positive and negative transcript isoforms in training set:

full-splice_match novel_not_in_catalog 231 231

    Positive and negative transcript isoforms in test set:

full-splice_match novel_not_in_catalog 57 57


Training Random Forest Classifier...

    ***Note: this can take up to several hours.

Pre-defined Random Forest parameters (supplied to caret::trainControl()):

Loading required package: ggplot2 Loading required package: lattice

Random forest training finished.

Saved generated classifier to randomforest.RData file.


Random forest evaluation: applying classifier to test set...

Test set evaluation results:

AUC, Sensitivity and Specificity on test set: ROC Sens Spec 0.9713758 0.7719298 0.9824561

Writing summary to testSet_summary.txt file.

Confusion matrix: Reference Prediction POS NEG POS 44 1 NEG 13 56

Writing confusion matrix and statistics to output files: testSet_confusionMatrix.txt testSet_stats.txt

Global variable importance in Random Forest classifier: Overall min_cov 35.7541546 min_sample_cov 35.5178859 bite 27.7809271 gene_exp 18.3663935 sd_cov 17.7682316 predicted_NMD 14.7867100 iso_exp 10.6466531 diff_to_gene_TSS 9.1437091 ratio_TSS 8.2086366 FSM_class 7.5753832 length 6.9421819 diff_to_gene_TTS 5.9650142 exons 5.8164596 perc_A_downstream_TTS 5.2488334 ratio_exp 0.6994311 coding 0.5164620

Variable importance table saved as classifier_variable-importance_table.txt

Calculating and printing test set ROC curves... Setting levels: control = 1, case = 2 Setting direction: controls > cases Setting levels: control = 1, case = 2 Setting direction: controls < cases

ROC curves saved to testSet_ROC_curve.pdf file. Includes:


Applying Random Forest classifier to input dataset...

Random forest prediction finished successfully!

Random forest classification results:

Negative Positive 1648 1690 Warning message: package ‘ggplot2’ was built under R version 4.3.2


Applying intra-priming filter to our dataset.

Intra-priming filtered transcripts:

FALSE TRUE 3213 712


Writing filter results to classification file...

    Wrote filter results (ML and intra-priming) to new classification table:
    UHR_chr22_MLresult_classification.txt file.

    Wrote isoform list (classified as non-artifacts by both ML and intra-priming
    filters) to UHR_chr22_inclusion-list.txt file

SUMMARY OF MACHINE LEARNING + INTRA-PRIMING FILTERS:

Artifact Isoform 2177 1748


SQANTI3 ML filter finished successfully!



     SQANTI3 Machine Learning filter report

Loading required package: magrittr

Reading ML result classification table...

Reading classifier variable importance table... Rows: 16 Columns: 2 ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr (1): variable dbl (1): importance

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Reading ML filter parameters... Rows: 53 Columns: 2 ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr (2): parameter, value

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Reading ML performance statistics... Rows: 18 Columns: 2 ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr (1): metric dbl (1): value

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. Rows: 4 Columns: 3 ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "\t" chr (2): Prediction, Reference dbl (1): Freq

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. Warning message: There were 2 warnings in dplyr::mutate(). The first warning was: ℹ In argument: structural_category =%>%(...). Caused by warning: ! Unknown levels in f: genic_intron ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. Loading required package: ggplot2 Warning message: package ‘ggplot2’ was built under R version 4.3.2 Warning in install.packages("RColorConesa") : 'lib = "/sw/local/rocky8.6/noarch/qcif/software/miniconda3/envs/sqanti3_5.2/lib/R/library"' is not writable Error in install.packages("RColorConesa") : unable to install packages Calls: suppressMessages -> withCallingHandlers -> install.packages Execution halted (base) [uqwwijes@bun101 SQANTI3_output_original_names_after_reinstallation2]$

Many thanks, Upendra.

carolinamonzo commented 7 months ago

Hi @Upendra19993, Unfortunately, it seems you were missing a simple R package for coloring the resulting plots. When you installed SQANTI3, did you install the conda environment "SQANTI.env"? If you install the environment, it makes all the installations to the correct versions needed by the package (You can check how to do this in the SQANTI3 documentation: https://github.com/ConesaLab/SQANTI3/wiki/Dependencies-and-installation#2-creating-the-conda-environment )

It seems you are running SQANTI3 from your "base" environment. You should either install the SQANTI3 environment and then run "conda activate SQANTI.env", or install the RColorConesa package (https://cran.r-project.org/web/packages/RColorConesa/index.html ) on your base environment.

Upendra19993 commented 7 months ago

Hi carolinamonzo,

No, I didn't install conda environment "SQANTI.env when installed sqanti3.

But now I installed sqanti3 in conda environment "SQANTI.env and ran the filtering step. I didn't get the previous error of missing RColorConesa package, but got warning messages regarding accessing the CRAN to install or load packages and ggplot2. The message is as below.

Warning message: There were 2 warnings in dplyr::mutate(). The first warning was: ℹ In argument: structural_category =%>%(...). Caused by warning: ! Unknown levels in f: genic_intron ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. Loading required package: ggplot2 Warning message: package ‘ggplot2’ was built under R version 4.3.2 Error in contrib.url(repos, type) : trying to use CRAN without setting a mirror Calls: suppressMessages ... withCallingHandlers -> install.packages -> startsWith -> contrib.url Execution halted (SQANTI3.env) [uqwwijes@bunya3 SQANTI3-5.2]$

I get all the output files, but not sure whether they are accurate due to warning messages I get. Could you please have a look and suggest on how to proceed to resolve this issue.

Many thanks, Upendra.

carolinamonzo commented 7 months ago

Hi @Upendra19993 the warnings are not worrisome. I'll update the SQANTI installation steps so the warning doesn't appear. In your case, it installed from the cloud since the source wasn't specified. You can go ahead and continue with your analysis, the warnings you found have not affected your data.

Best, Carolina.

Upendra19993 commented 7 months ago

Many thanks, carolinamonzo!

CaiCheng1996 commented 4 months ago

I met the same problem, it's seems like it were not been fixed in the newest SQANTI3-5.2?

mayunlong89 commented 3 months ago

@CaiCheng1996 file.edit(".Rprofile") options(repos = c(CRAN = "https://cloud.r-project.org")) Try this. That did the trick, amazing.