PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
249 stars 44 forks source link

ValueError: pbfusion - <visualize_fusion.py> #587

Closed jinhys closed 1 year ago

jinhys commented 1 year ago

Operating system Which operating system and version are you using?

cat /etc/os-release

NAME="Rocky Linux"
VERSION="8.7 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.7"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"

Package name Which package / tool is causing the problem? Which version are you using, use tool --version. Have you updated to the latest version conda update package? Have you updated the complete env by running conda update --all? Have you ensured that your channel priorities are set up according to the bioconda recommendations at https://bioconda.github.io/#set-up-channels?

pbfusion - <_visualize_fusion.py_>

Conda environment What is the result of conda list? (Try to paste that between triple backticks.)

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
bzip2                     1.0.8                h7b6447c_0
c-ares                    1.19.0               h5eee18b_0
ca-certificates           2023.01.10           h06a4308_0
certifi                   2022.12.7        py37h06a4308_0
curl                      7.88.1               h5eee18b_0
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
icu                       67.1                 he1b5a44_0    conda-forge
isoseq3                   3.8.2                h9ee0642_0    bioconda
kiwisolver                1.4.4            py37h6a678d5_0
krb5                      1.19.4               h568e23c_0
ld_impl_linux-64          2.38                 h1181459_1
libblas                   3.9.0           15_linux64_openblas    conda-forge
libcblas                  3.9.0           15_linux64_openblas    conda-forge
libcurl                   7.88.1               h91b91d3_0
libdeflate                1.0                  h14c3975_1    bioconda
libedit                   3.1.20221030         h5eee18b_0
libev                     4.33                 h7f8727e_1
libffi                    3.4.2                h6a678d5_6
libgcc-ng                 11.2.0               h1234567_1
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   11.2.0               h1234567_1
liblapack                 3.9.0           15_linux64_openblas    conda-forge
libnghttp2                1.46.0               hce63b2e_0
libopenblas               0.3.20          pthreads_h78a6416_0    conda-forge
libpng                    1.6.39               h5eee18b_0
libssh2                   1.10.0               h8f2d780_0
libstdcxx-ng              11.2.0               h1234567_1
matplotlib                3.2.2                         1    conda-forge
matplotlib-base           3.2.2            py37h1d35a4c_1    conda-forge
ncurses                   6.4                  h6a678d5_0
numpy                     1.21.6           py37h976b520_0    conda-forge
openssl                   1.1.1t               h7f8727e_0
pbfusion                  0.1.0                hdfd78af_0    bioconda
pip                       22.3.1           py37h06a4308_0
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pysam                     0.15.3           py37hda2845c_1    bioconda
python                    3.7.16               h7a1cb2a_0
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
readline                  8.2                  h5eee18b_0
setuptools                65.6.3           py37h06a4308_0
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.41.2               h5eee18b_0
tk                        8.6.12               h1ccaba5_0
tornado                   6.1              py37h540881e_3    conda-forge
typing_extensions         4.5.0              pyha770c72_0    conda-forge
wheel                     0.38.4           py37h06a4308_0
xz                        5.2.10               h5eee18b_1
zlib                      1.2.13               h5eee18b_0

Describe the bug A clear and concise description of what the bug is.

I'm getting a ValueError (when I run the command I've included in the [To Reproduce] section) and the output (i.e., PNG file) seems to be not generated due to this error.

Error message Paste the error message / stack.

Traceback (most recent call last):

  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 238, in <module>
    main(args)

  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 234, in main
    plot_fusion(ref_genes, alignments, breakpoints, args.output)

  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 134, in plot_fusion
    gene1_start = min([x[0][1][0][0] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])

ValueError: min() arg is an empty sequence

To Reproduce Steps to reproduce the behavior. Providing a minimal test dataset on which we can reproduce the behavior will generally lead to quicker turnaround time!

I ran this command below and I'm getting the ValueError that I mentioned above: (FYI, I've checked that all these input files are not empty and they seem to have all the info they should have. But please let me know if I'm missing any details here.)

python3 ~/tool/pbfusion/scripts/visualize_fusion.py \
-o ~/data/PacBio/s1/s1.fusion_browser_shot.png \
-a ~/reference/homo_sapiens/gencode.v38.annotation.gtf \
-f ~/data/PacBio/s1/s1.breakpoints.groups.bed \
-b ~/data/PacBio/s1/s1.reads_mapped.bam

Expected behavior A clear and concise description of what you expected to happen.

I would like to ask for any advice to solve this ValueError - My final goal is to produce visualization files for the fusions detected by pbfusion. Thanks in advance for your help!

zeeev commented 1 year ago

Hi @jinhys, When you call the visualization script you need to isolate a single call. Can you try creating a new breakpoint file with just the entry you'd like to plot, excluding the header?

Let me know if that doesn't work. Thanks again for reporting this issue.

jinhys commented 1 year ago

Hi @zeeev,

Thanks for checking this issue.

I did try re-running the visualization script by using a shorter version of the BED file as you mentioned (i.e., included 4 entry lines of the full breakpoint BED file output - without the header). Unfortunately, I'm getting different errors this time. Can you advise on this front? (The error messages are attached below.)

Traceback (most recent call last):
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/site-packages/matplotlib/style/core.py", line 114, in use
    rc = rc_params_from_file(style, use_default_template=False)
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/site-packages/matplotlib/__init__.py", line 984, in rc_params_from_file
    config_from_file = _rc_params_in_file(fname, fail_on_error)
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/site-packages/matplotlib/__init__.py", line 914, in _rc_params_in_file
    with _open_file_or_url(fname) as fd:
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/site-packages/matplotlib/__init__.py", line 900, in _open_file_or_url
    with open(fname, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'clean'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 238, in <module>
    main(args)
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 234, in main
    plot_fusion(ref_genes, alignments, breakpoints, args.output)
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 143, in plot_fusion
    plt.style.use('clean')
  File "~/tool/miniconda3/envs/pacbio/lib/python3.7/site-packages/matplotlib/style/core.py", line 120, in use
    "available styles".format(style))
OSError: 'clean' not found in the style library and input is not a valid URL or path; see `style.available` for list of available styles

Please let me know if you need any further details. Thanks!

velociroger-pb commented 1 year ago

Hi @jinhys, you'll need to copy the mplstyle file to your matplotlib config dir. Something like this should work: mkdir -p ~/.config/matplotlib/stylelib && cp clean.mplstyle ~/.config/matplotlib/stylelib && cp ~/.config/matplotlib/stylelib/clean.mplstyle ~/.config/matplotlib/stylelib/clean. I try to have both clean.mplstyle and clean in the stylelib because sometimes mpl can be picky. Another note, the visualization script needs the input bed file to only have a single entry instead of 4. Having the header lines is fine since those get skipped

jinhys commented 1 year ago

Hi @velociroger-pb,

Thanks a lot for your comment. I've fixed all those errors and finally got the PNG files!

A few suggestions for further development on the visualization, it would be great if we can also add annotations in the output - such as isoform IDs and types (e.g., NIC, NNC, FSM, ISM, etc.) - and save them in PDF format.

Thanks again for your help!

zeeev commented 1 year ago

@jinhys,

Glad we could resolve the issue for you. Please feel free to re-open the ticket if needed. Hopefully, you found an interesting fusion.

Best,

Zev

jinhys commented 1 year ago

Hi,

I did manage to get the PNG output files previously - Unfortunately, when I tried to generate the same type of output for another fusion with a different combination of chromosomes, I'm getting this ValueError again:

Traceback (most recent call last):
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 266, in <module>
    main(args)
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 262, in main
    plot_fusion(ref_genes, alignments, breakpoints, args.output)
  File "~/tool/pbfusion/scripts/visualize_fusion.py", line 138, in plot_fusion
    gene1_start = min([x[0][1][0][0] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])
ValueError: min() arg is an empty sequence

(FYI, I've used the most recent version of the scripts published on GitHub.)

Can you advise on this error? Thanks in advance!

velociroger-pb commented 1 year ago

To me it looks like you aren't left with any alignments after filtering. Can you post the contents of the bed file you're trying to visualize? This script will only work with fusions that involve two genes, but you can modify the bed file to force it to only look at the two primary genes.

jinhys commented 1 year ago

Hi @velociroger-pb,

I've attached the 1-line content of the two individual BEDPE files for the visualization:

  1. an example which causes the ValueError mentioned above

    chr17   81996845    81996846    chrX    49028728    49028729    BP665   MEDIUM  +   -   READ_COUNT=43;BP_COUNT=43;SA=43,0;BMD=0.465116,0,0.930233;GENE_NAMES=ASPSCR1,TFE3;GENE_NAME_COUNTS=287,251;GENE_IDS=ENSG00000169696.16,ENSG00000068323.17;GENE_ID_COUNTS=287,251;EXONIC=43,43;INTRONIC=0,0;SRCF=15:43;SINKF=15:43;RT=FUSION:43  READS=m64011_221002_221225/753/ccs,m64011_221002_221225/9700219/ccs,m64011_221002_221225/13370638/ccs,m64011_221002_221225/30279250/ccs,m64011_221002_221225/33489545/ccs,m64011_221002_221225/44762815/ccs,m64011_221002_221225/58655980/ccs,m64011_221002_221225/65014418/ccs,m64011_221002_221225/68355721/ccs,m64011_221002_221225/73336608/ccs,m64011_221002_221225/79561655/ccs,m64011_221002_221225/84871992/ccs,m64011_221002_221225/100139028/ccs,m64011_221002_221225/115475948/ccs,m64011_221002_221225/121505328/ccs,m64011_221002_221225/128714267/ccs,m64011_221002_221225/129763260/ccs,m64011_221002_221225/136380825/ccs,m64011_221002_221225/136577787/ccs,m64011_221002_221225/144245732/ccs,m64011_221002_221225/149424083/ccs,m64011_221002_221225/155847037/ccs,m64011_221002_221225/177407407/ccs,m64011_221002_221225/11796742/ccs,m64011_221002_221225/31129683/ccs,m64011_221002_221225/52756930/ccs,m64011_221002_221225/70124624/ccs,m64011_221002_221225/84346910/ccs,m64011_221002_221225/85524942/ccs,m64011_221002_221225/143329543/ccs,m64011_221002_221225/24774818/ccs,m64011_221002_221225/83755593/ccs,m64011_221002_221225/100862570/ccs,m64011_221002_221225/108397667/ccs,m64011_221002_221225/120127760/ccs,m64011_221002_221225/164693472/ccs,m64011_221002_221225/176816615/ccs,m64011_221002_221225/177146357/ccs,m64011_221002_221225/40764636/ccs,m64011_221002_221225/96799349/ccs,m64011_221002_221225/126485532/ccs,m64011_221002_221225/75498126/ccs,m64011_221002_221225/153092816/ccs
  2. an example which generates an empty visualization output (i.e., this BEDPE file does generate a PNG file but the content of the PNG file shows no visualized reads.)

    chr18   26090609    26090610    chr5    176346066   176346067   BP718   MEDIUM  -   -   READ_COUNT=2;BP_COUNT=2;SA=0,2;BMD=0,0,0;GENE_NAMES=SS18,KIAA1191;GENE_NAME_COUNTS=22,2;GENE_IDS=ENSG00000141380.14,ENSG00000122203.15;GENE_ID_COUNTS=22,2;EXONIC=2,2;INTRONIC=0,0;SRCF=11:2;SINKF=11:2;RT=FUSION:2 READS=m54086U_221001_224915/64619292/ccs,m54086U_221001_224915/79364686/ccs

Thanks!

velociroger-pb commented 1 year ago

I would print out the alignment dict if you know how. To me it looks like everything is getting filtered, which is why you would end up with nothing when looking for the leftmost position in the line that it errors on.

jinhys commented 1 year ago

Hi @velociroger-pb,

Thanks for your reply. I understand your point on the filtering, but I wonder whether we can adjust the filtering criteria to be less stringent when running the visualize_fusion.py script. (Or do I have to adjust the parameters from re-running the pbfusion script?)

Also, can you advise on printing out the alignment dict that you mentioned?

Many thanks!

velociroger-pb commented 1 year ago

So on line 132 you should currently have:

alignments = {k: sorted(v) for (k, v) in alignments.items() if len(v) == 2}

at the same indentation level, you'll want to have:

print(alignments)
alignments = {k: sorted(v) for (k, v) in alignments.items() if len(v) == 2}
print(alignments)

The filtering at this step is to ensure that each read only has two alignments (primary and supplementary), which will get plotted. All of the pbfusion filtering options are more about refining which breakpoints are reported and less about cleaning individual reads/alignments for visualization.

jinhys commented 1 year ago

Hi @velociroger-pb,

Thanks for the code lines! - I will check my output files again with the printing out step, and reach out again if I need further help.

Thanks a lot :)

mariachiara-github commented 6 months ago

Hi @velociroger-pb,

Thanks for the code lines! - I will check my output files again with the printing out step, and reach out again if I need further help.

Thanks a lot :)

Hi @jinhys ! were you able to fix your problem ? I'm having the same issue at the moment ...

jinhys commented 5 months ago

Hi @meryyr - Sorry for the belated reply.

As far as I remember, it was not a technical error and it had more to do with the embedded read/alignment filtering criteria as mentioned above (which I included as a quote below). I think it would be more helpful for you to reach out to @velociroger-pb / @zeeev or other PacBio Team members directly.

The filtering at this step is to ensure that each read only has two alignments (primary and supplementary), which will get plotted. All of the pbfusion filtering options are more about refining which breakpoints are reported and less about cleaning individual reads/alignments for visualization.