ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
197 stars 48 forks source link

ML Filter output: Missing files, inverted ROC curve x-axis. #301

Closed klenhart closed 3 months ago

klenhart commented 4 months ago

Is there an existing issue for this?

Have you loaded the SQANTI3.env conda environment?

Problem description

Hello :) I have successfully applied the ML filter to my dataset. However, a few output files are missing. The missing files are:

I noticed that the example ML filter output also lists filtered gtf and fasta files, which are not mentioned in the ML filter output section on the "Running SQANTI3 filter" wiki page. Were these created by manually filtering the gtf and fasta files from the SQANTI3 QC output with the inclusion list file?

Sadly, I also can't make sense of the ROC curve. It seems like the x-axis has been inverted. testSet_ROC_curve.pdf.

Did I miss something?

Thank you in advance, Katharina

Code sample

/home/katharina/src/SQANTI3-5.2.1/sqanti3_filter.py ml /home/katharina/Data/sqanti/qc/ENCSR319VGI_classification.txt -p /home/katharina/Data/sqanti/ml_filter/RM_ids_TP.txt -n /home/katharina/Data/sqanti/ml_filter/NNC_ids_TN.txt -o filtered -d /home/katharina/Data/sqanti/ml_filter/

Error

No response

Anything else?

No response

aarzalluz commented 4 months ago

Hi @klenhart,

  • TP_list.txt, TN_list.txt (I used custom TP and TN records, where the TN record had to be downsampled).

As explained in the wiki TP and TN lists are generated only when no user-defined sets are provided to allow users to track the isoforms that were selected internally. If you wish to have control over your TN set, I suggest you downsample it before inputting it to the ML filter.

  • Filter report (I did not use the --skip_report option)

This is strange, and most likely indicative that the ML filter did not finish running. Could you check the log and/or attach it to see if there are any warnings or error messages?

I noticed that the example ML filter output also lists filtered gtf and fasta files, which are not mentioned in the ML filter output section on the "Running SQANTI3 filter" wiki page. Were these created by manually filtering the gtf and fasta files from the SQANTI3 QC output with the inclusion list file?

As indicated in the wiki, you need to provide the GTF and fasta files (and basically any other QC output file that you wish to filter) when running the ML filter in order for it to be filtered using the generated inclusion list. There are specific arguments for each of these (see the documentation). Of note, we do not do this by default as it is advised for users to verify the filter results (see that they make sense, optimize parameter configuration and TP/TN sets, etc.) before taking this as the definitive filter for their data.

Sadly, I also can't make sense of the ROC curve. It seems like the x-axis has been inverted. testSet_ROC_curve.pdf.

The ROC curve is computed using the ROC(), auc() y plot.roc() functions from the pROC R package. You may refer to their documentation for more information.

Let me know if you find anything odd in the log so that I can help solve the issue!

Ángeles

klenhart commented 4 months ago

Hello @aarzalluz, thank you for your quick response.

I checked the log and it reported that the ML filter finished successfully. When creating the report it reports a warning about ggplot:

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. Loading required package: ggplot2 Warning message: package ‘ggplot2’ was built under R version 4.3.3 Error in contrib.url(repos, type) : trying to use CRAN without setting a mirror Calls: suppressMessages ... withCallingHandlers -> install.packages -> startsWith -> contrib.url Execution halted

I guess this caused the problem. However, I followed the installation instructions on the wiki.

aarzalluz commented 3 months ago

Hi @klenhart,

Sorry about the late reply. By the warning, it looks like the R installation on your conda environment and the R version under which ggplot2 was built may not be the same. In the YML for the environment, we force the ggplot2 version to be >= 3.4.0 because of some updates we did, replacing some deprecated for their current replacement. Not sure if this could be causing the problem. Issue #304 mentions a similar error, so I guess this may require some looking into.

I am not actively working on SQANTI3 anymore, so I am unable to change the conda environment config, but it may be worth trying to update ggplot2 directly within the conda environment to see if that does the trick. It seems that the error you are getting has to do with R trying to update ggplot2, but being unable to fetch the package since you have not selected a CRAN mirror. Not sure if you searched existing issues, but this was already discussed in #259 and a fix was provided here.

Hope that helps!

Ángeles

carolinamonzo commented 3 months ago

The problem with CRAN happened because the SQANTI_filter_report.R calls to CRAN to install a color package if it's not already installed. The issue is now fixed by @alexpan00 in commit 96397a9, by setting the CRAN mirror before calling the installation.