pull malfunctions on Debian and CentOS

breakthewall commented 2 years ago

Hi,

we still use pySBOL with old tools and I noticed a malfunction under Debian and CentOS. And we try to run these commands under a Galaxy instance which is under Debian.

If we install pySBOL via pip and run the following commands:

import sbol
doc = sbol.Document()
igem = sbol.PartShop('https://synbiohub.org/public/igem')
igem.pull('BBa_R0010', doc)
print(doc)

the following result is obtained:

Attachment....................0
Collection....................0
CombinatorialDerivation.......0
ComponentDefinition...........0
Experiment....................0
Test..........................0
Implementation................0
Model.........................0
ModuleDefinition..............0
Sequence......................0
Analysis......................0
Build.........................0
Design........................0
SampleRoster..................0
Activity......................0
Agent.........................0
Plan..........................0
Annotation Objects............0
---
Total.........................0

While on Ubuntu, the following result (expected) is obtained:

Attachment....................0
Collection....................0
CombinatorialDerivation.......0
ComponentDefinition...........1
Experiment....................0
Test..........................0
Implementation................0
Model.........................0
ModuleDefinition..............0
Sequence......................1
Analysis......................0
Build.........................0
Design........................0
SampleRoster..................0
Activity......................1
Agent.........................0
Plan..........................0
Annotation Objects............0
---
Total.........................3

Python 3.7, 3.8 on different computers of different users have been tested.

I know it's outdated by we really need to use these function of pySBOL as tools we used are not under development anymore and not written by us. Thanks!

jakebeal commented 2 years ago

Can you please prove some additional information about the nature of the problem? What tools are you using that depend on pySBOL in this way? In general, pySBOL2 should be able to used as a drop-in replacement for pySBOL, and pySBOL2 is being actively maintained while pySBOL is not.

breakthewall commented 2 years ago

Many thanks for your (very) fast reply.

Actually we are developing a suite of tools for Synthetic Biology (we asked for the creation of such a category under the main toolshed of Galaxy). Among these tools, some of them have been developed with pySBOL (PartsGenie, doebase, LCRGenie, DNAWeaver). I tried to replace sbol import by sbol2 but it is not straightforward and code have to be modified. Since some of these codes are not under development anymore, we succeeded to use them with pySBOL.

At the end of the workflow, doebase works under Linux and macOS. However, sometimes output file (SBOL) is different and generates errors with downstream tools (LCRGenie and DNAWeaver). Actually, errors on both are the same.

So I had a look into doebase tool and I saw a different behavior depending on the OS: works good on macOS and fails on Galaxy instance Debian.

Taking a closer look, I observed that the output file (SBOL) of doebase is well-formed (no error with downstream tools) when it has been ran under macOS and Linux Ubuntu but is somehow malformed (error with downstream tools) when it has been ran under Linux Debian or CentOS.

Then, I have isolated the piece of code which has different behavior and it turns out that the method PartShop::pull does not affect the Document on some (not all) parts (an example is given in my previous post). So there are some missing sequences in the output file which that generates some error afterwards.

I checked Python versions and conda environments which are strictly the same in both Debian and Ubuntu:

_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
ca-certificates           2021.10.8            ha878542_0    conda-forge
certifi                   2021.10.8                pypi_0    pypi
charset-normalizer        2.0.10                   pypi_0    pypi
distro                    1.6.0                    pypi_0    pypi
idna                      3.3                      pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 11.2.0              h1d223b6_11    conda-forge
libgomp                   11.2.0              h1d223b6_11    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libstdcxx-ng              11.2.0              he4da1e4_11    conda-forge
libzlib                   1.2.11            h36c2ea0_1013    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
openssl                   3.0.0                h7f98852_2    conda-forge
pip                       21.3.1             pyhd8ed1ab_0    conda-forge
pysbol                    2.3.3.post9              pypi_0    pypi
python                    3.8.12          hf930737_2_cpython    conda-forge
python_abi                3.8                      2_cp38    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.27.1                   pypi_0    pypi
setuptools                60.5.0           py38h578d9bd_0    conda-forge
sqlite                    3.37.0               h9cd32fc_0    conda-forge
tk                        8.6.11               h27826a3_1    conda-forge
urllib3                   1.26.8                   pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h36c2ea0_1013    conda-forge

jakebeal commented 2 years ago

@bbartley Do you think this is a wheel issue? Is there any good way to tweak that?

@breakthewall Is there any ability to build from source instead of installing via pip in your environment? That might be a way to get around it too.

bbartley commented 2 years ago

The problem appears to be that the user can't fetch from Synbiohub on CentOS. My hunch is that the HTTPS protocol used by Synbiohub is not supported on this system. Because some linux systems lack built-in crypto libraries (and CentOS might be one of these), some old wheels do not support HTTPS. In later wheel versions, I started statically linking crypto libraries into the pysbol binaries to get around this problem.

@breakthewall Prior to calling pull, can you call this method to enable logging?

pysbol.Config.setOption('verbose', True)

See if this confirms my hypothesis above, or provides any further hints about what the issue might be.

Also, can you determine which precise wheels are being installed in your different environments?

pip show pysbol

I don't have much hope that there is an easy fix here, so most likely we will have to find a workaround. Or figure out a migration path to the native pysbol2.

jakebeal commented 2 years ago

I also wonder: can these tools be configured to use a local file rather than one retrieved from SynBioHub? If so, a workaround could be to download the files from SynBioHub via a copy of pySBOL2 and then run against the files rather than downloading on the fly.

breakthewall commented 2 years ago

@jakebeal Thank you for your suggestion. Actually, we deploy tools on the Galaxy Tool Shed. Therefore, tools must be available as conda packages. As pySBOL is not available on anaconda.org, we already had to tweak to make it available on conda via one of our packages. Building form source could be useful for testing but not for deploying version.

One additional information is that Debian and Ubuntu tested were Docker containers but I don't see any problem with that.

breakthewall commented 2 years ago

@bbartley Below this is the result

igem.pull('BBa_R0010', doc)
Issuing get request:
https://synbiohub.org/public/igem/BBa_R0010/sbol
Issuing get request: https://synbiohub.org/public/igem/BBa_R0010/sbol

This is weird because a wget download well the file on the same system (Debian Docker container).

breakthewall commented 2 years ago

@jakebeal ok I think this could a workaround but I prefer keep it as a last resort since I am not the developer of the code (we only integrate this tool) that worked at some point. I keep that tip as a backup. Thanks!

jakebeal commented 2 years ago

I just went digging in a little deeper and found something curious that might be important.

I see that doebase doesn't actually use pySBOL as distributed --- its call to PartShop is actually to an import of SbmlToSBOL. In SbmlToSBOL, pySBOL isn't actually being called either: instead, it's got its own wrapper on libsbol.

I see that doebase doesn't actually use SBOL anywhere other than in that one file, and all of the usage that I saw there looks compatible with pySBOL2. Can you try changing just that dependency and not the other tools?

jakebeal commented 2 years ago

I've got a version of doebase passing tests with pySBOL2 and have set up a pull request to merge it into doebase: https://github.com/pablocarb/doebase/pull/9

Making this work required correcting a minor bug in doebase, which was exercised only with pySBOL2 and not with pySBOL.

jakebeal commented 2 years ago

@breakthewall: I see that you have contributed to doebase. Do you have maintainer privileges over there, or do we need @pablocarb to review and approve the pull?

breakthewall commented 2 years ago

Many thanks for your work guys! However, as I said:

Basing code on sbol2 makes the tool running but generates errors on downstream tools (lcr_genie and dnaweaver_synbiocad) since each tool of the chain has been built on SBOL v1 (sbml2sbol, partsgenie_client, doebase, lcr_genie, dnaweaver_synbiocad). I just tried with your changes and it produces a SBOL file that causes an error into lcr_genie, that it is not the case with a file generated by doebase under Ubuntu or macOS. A solution could be to migrate all tools to SBOL2 but it could be a huge work;

The problem below is nothing related with doebase:

>>> import sbol
>>> sbol.Config.setOption('verbose', True)
>>> doc = sbol.Document()
>>> igem = sbol.PartShop('https://synbiohub.org/public/igem')
>>> igem.pull('BBa_R0010', doc)
Issuing get request:
https://synbiohub.org/public/igem/BBa_R0010/sbol
Issuing get request: https://synbiohub.org/public/igem/BBa_R0010/sbol
>>> print(doc)
Attachment....................0
Collection....................0
CombinatorialDerivation.......0
ComponentDefinition...........0
Experiment....................0
Test..........................0
Implementation................0
Model.........................0
ModuleDefinition..............0
Sequence......................0
Analysis......................0
Build.........................0
Design........................0
SampleRoster..................0
Activity......................0
Agent.........................0
Plan..........................0
Annotation Objects............0
---
Total.........................0

I am not a maintainer of doebase, I have to make PR.

jakebeal commented 2 years ago

I think that there is no easy solution here --- either the tools need to be upgraded (and debugged) or the wheel needs to be debugged and rebuilt.

I suspect that the tool upgrade will be more sustainable: 1) These types of cross-system wheel differences are a major part of why we created pySBOL2 in the first place. 2) Since pySBOL2 is a drop-in replacement for pySBOL, any failure in a tool generally indicates a bug that has already been lurking in that tool that needed fixing.

jakebeal commented 2 years ago

doebase is now updated to pySBOL2: https://github.com/pablocarb/doebase/pull/9#event-5911951741

Where are the repositories for the other tools?

breakthewall commented 2 years ago

Thank you very much! Pablo said me it is ok with LCRGenie but I have still some issues. The workflow is the following:

sbml2sbol -> partsgenie_client -> partsgenie (server) -> doebase -> lcr_genie + dnaweaver_synbiocad

sbml2sbol: https://github.com/neilswainston/SbmlToSbol (maintainer) partsgenie_client: https://github.com/neilswainston/PartsGenieClient (maintainer) partsgenie (server): https://github.com/neilswainston/PartsGenie (maintainer) doebase: https://github.com/pablocarb/doebase (PR) lcr_genie: https://github.com/neilswainston/LCRGenie (maintainer) dnaweaver_synbiocad: https://github.com/brsynth/DNAWeaver_SynBioCAD (maintainer)

breakthewall commented 2 years ago

Ok so after multiple tests, Pablo (doebase developper) confirms that the output file of new version of doebase provides an error with LCRGenie.

jakebeal commented 2 years ago

Do you have a test case that demonstrates said error?

breakthewall commented 2 years ago

SBOL2 From the master branch (pysbol2) of doebase: python -m doebase tests/data/input/lycopene.csv --sbol_file tests/data/input/lycopene.xml --func doeGetSBOL constructs.xml

Then, from LCRGenie master branch: python -m lcr_genie constructs.xml plan.xlsx Got (with SBOL or SBOL2):

Traceback (most recent call last):
  File "/Users/jherisson/opt/miniconda3/envs/lcr_genie/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/jherisson/opt/miniconda3/envs/lcr_genie/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/jherisson/github/LCRGenie/lcr_genie/__main__.py", line 44, in <module>
    entry_point()
  File "/Users/jherisson/github/LCRGenie/lcr_genie/__main__.py", line 27, in entry_point
    part_seqs, construct_parts, construct_seqs = sbol_utils.parse(path=args.input)
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 59, in parse
    for construct_name, parts in parts_per_construct
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 59, in <listcomp>
    for construct_name, parts in parts_per_construct
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 58, in <listcomp>
    (construct_name, ''.join([parts_seqs[part] for part in parts]))
KeyError: ‘P54978_10000_gene'

SBOL From the stable branch (pysbol) of doebase: python -m doebase tests/data/input/lycopene.csv --sbol_file tests/data/input/lycopene.xml --func doeGetSBOL constructs.xml

Then, from LCRGenie master branch: python -m lcr_genie constructs.xml plan.xlsx OK

jakebeal commented 2 years ago

I believe I've found the problem in LCRGenie: it's currently assuming that when you iterate over a set-value feature, that the set will be sorted in alphabetical order. This is a fragile assumption that happens to work only because libSBOL happens to be writing in that order, reading in that order, and then nothing touched the document before it walked the features. pySBOL2 happens to not always give alphabetical order in this situation, and that broke its assumption.

I've got a branch set up with a fix that works locally for me, but seem to have something wrong in the conda setup for automated testing still: https://github.com/jakebeal/LCRGenie/tree/upgrade-to-pySBOL2

jakebeal commented 2 years ago

OK, got it worked out. There is a pull request set up for LCRGenie now: https://github.com/neilswainston/LCRGenie/pull/1

jakebeal commented 2 years ago

@breakthewall I've now got a pull request set up for you on dnaweaver_synbiocad as well: https://github.com/brsynth/DNAWeaver_SynBioCAD/pull/1 The upgrade needed here was identical to that of LCRGenie.

breakthewall commented 2 years ago

I'm impressed by your reactivity. I think LCRGenie and DNAWeaver_SynBioCAD have that part in common (copy/paste).

Tests (files in SBOL) run well for LCRGenie which means that it is still compliant with SBOL files with your modifications. However, I get the same error as before on new files (SBOL2) provided by doebase (new version)

Traceback (most recent call last):
  File "/Users/jherisson/opt/miniconda3/envs/lcr_genie/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/jherisson/opt/miniconda3/envs/lcr_genie/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/jherisson/github/LCRGenie/lcr_genie/__main__.py", line 44, in <module>
    entry_point()
  File "/Users/jherisson/github/LCRGenie/lcr_genie/__main__.py", line 27, in entry_point
    part_seqs, construct_parts, construct_seqs = sbol_utils.parse(path=args.input)
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 63, in parse
    for construct_name, parts in parts_per_construct
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 63, in <listcomp>
    for construct_name, parts in parts_per_construct
  File "/Users/jherisson/github/LCRGenie/lcr_genie/sbol_utils.py", line 62, in <listcomp>
    (construct_name, ''.join([parts_seqs[part] for part in id_sort(parts)]))
KeyError: 'P21683_10000_gene'

The file generated by doebase (SBOL2) is available here: constructs.xml

jakebeal commented 2 years ago

I've been working on this error now too... it looks like the problem is that when using pySBOL2, doebase is not actually connecting the sequences to the ComponentDefinition objects for the genes. It looks like the problem is somewhere in synbioParts._defineParts. That function, unfortunately, silently swallows a lot of exceptions, so I suspect that there is a bug in there that is getting exercised by the change in libraries.

I need to work on other things for the rest of today; do you want to try to dig into that error?

breakthewall commented 2 years ago

I agree with your diagnosis, that was I felt. I know Pablo (doebase dev) is busy today with an important meeting (where I'm also involved) but I'm gonna try to dig into this issue.

breakthewall commented 2 years ago

First feedback, I do not understand why I have this behavior (lycopene.xml):

>>> import sbol2
>>> doc1 = sbol2.Document()
>>> doc1.read('tests/data/input/lycopene.xml')
>>> print(doc1)
Design........................0
Build.........................0
Test..........................0
Analysis......................0
ComponentDefinition...........87
ModuleDefinition..............0
Model.........................0
Sequence......................87
Collection....................0
Activity......................0
Plan..........................0
Agent.........................0
Attachment....................0
CombinatorialDerivation.......0
Implementation................0
SampleRoster..................0
Experiment....................0
ExperimentalData..............0
Annotation Objects............0
---
Total: .........................174

>>> doc2 = doc1.copy()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jherisson/opt/miniconda3/envs/test_sbol/lib/python3.7/site-packages/sbol2/document.py", line 987, in copy
    return super().copy(target_doc, target_namespace, version)
  File "/Users/jherisson/opt/miniconda3/envs/test_sbol/lib/python3.7/site-packages/sbol2/identified.py", line 281, in copy
    o_copy = o.copy(target_doc, target_namespace, version)
  File "/Users/jherisson/opt/miniconda3/envs/test_sbol/lib/python3.7/site-packages/sbol2/identified.py", line 265, in copy
    self.doc.add(new_obj)
  File "/Users/jherisson/opt/miniconda3/envs/test_sbol/lib/python3.7/site-packages/sbol2/document.py", line 219, in add
    ' to Document. An object with this identity '
sbol2.sbolerror.SBOLError: (<SBOLErrorCode.SBOL_ERROR_URI_NOT_UNIQUE: 17>, 'Cannot add http://liverpool.ac.uk/ComponentDefinition/D5KXJ0_20000_gene/1 to Document. An object with this identity is already contained in the Document')

In short, I tried the code of doebase:

>>> import sbol2
>>> doc1 = sbol2.Document()
>>> doc1.read('tests/data/input/lycopene.xml')
>>> print(doc1)
Design........................0
Build.........................0
Test..........................0
Analysis......................0
ComponentDefinition...........87
ModuleDefinition..............0
Model.........................0
Sequence......................87
Collection....................0
Activity......................0
Plan..........................0
Agent.........................0
Attachment....................0
CombinatorialDerivation.......0
Implementation................0
SampleRoster..................0
Experiment....................0
ExperimentalData..............0
Annotation Objects............0
---
Total: .........................174
>>> doc2 = sbol2.Document()
>>> doc1.copy('http://liverpool.ac.uk', doc2)
<sbol2.document.Document object at 0x7f9248421cd0>
>>> print(doc2)
Design........................0
Build.........................0
Test..........................0
Analysis......................0
ComponentDefinition...........87
ModuleDefinition..............0
Model.........................0
Sequence......................87
Collection....................0
Activity......................0
Plan..........................0
Agent.........................0
Attachment....................0
CombinatorialDerivation.......0
Implementation................0
SampleRoster..................0
Experiment....................0
ExperimentalData..............0
Annotation Objects............0
---
Total: .........................174
>>> for cd in doc1.componentDefinitions:
...     print(cd.sequence)
...
http://examples.org/Sequence/D5KXJ0_20000_gene_seq/1
http://examples.org/Sequence/P21683_10000_gene_seq/1
http://examples.org/Sequence/D5KXJ0_20000_cds_seq/1
.
.
.
>>> for cd in doc2.componentDefinitions:
...     print(cd.sequence)
None
None
None
None
.
.
.

jakebeal commented 2 years ago

Looks like there's a bug associated with the remapping of the namespace.

When I put in doc1.copy('http://liverpool.ac.uk', doc2), the identity is getting mapped into the default that has been set instead:

>>> doc1.componentDefinitions[0].identity
Out[31]: 'http://liverpool.ac.uk/ComponentDefinition/P21684_10000_gene/1'
>>> doc2.componentDefinitions[0].identity
Out[32]: 'http://synbiochem.co.uk/ComponentDefinition/P21684_10000_gene/1'

When I just copy the materials without attempting to remap the namespaces, it comes through correctly and still linked:

>>> doc1.copy(target_doc=doc2)
>>> doc1.componentDefinitions[0].identity
Out[39]: 'http://liverpool.ac.uk/ComponentDefinition/P21684_10000_gene/1'
>>> doc1.componentDefinitions[0].sequence
Out[40]: <sbol2.sequence.Sequence at 0x1389e7490>
>>> doc2.componentDefinitions[0].identity
Out[41]: 'http://liverpool.ac.uk/ComponentDefinition/P21684_10000_gene/1'
>>> doc2.componentDefinitions[0].sequence
Out[42]: <sbol2.sequence.Sequence at 0x135e0ce50>

jakebeal commented 2 years ago

I've filed a bug on pySBOL2 (https://github.com/SynBioDex/pySBOL2/issues/413).

Since I don't think there's a reason to try to remap the namespace to itself, however, copying without namespace remapping can be used, avoiding the bug.

jakebeal commented 2 years ago

Got a pull request up with something that appears to fix this issue (https://github.com/pablocarb/doebase/pull/12), including allowing the file from your test case to run without raising an exception in LCRGenie. I leave it to you to assess whether the output is the desired output or not, as I have discovered more troubling code fragilities in doebase.

breakthewall commented 2 years ago

Ok all seems to work now! I've tested doebase, lcr_genie and dnaweaver_synbiocad and no error raised, even on Debian.

I'm currently publishing new releases of these tools and I will have a look on upstream tools to try again to migrate to pySBOL2. I take your modifications and add some major code simplifications and made a new PR.

Thank you very much for your very useful help, we greatly appreciate!

jakebeal commented 2 years ago

You're welcome! Hopefully the changes that were made on these tools can be a good template for debugging switch-overs in the upstream tools as well.

jakebeal commented 2 years ago

I am closing this issue as I believe it is now complete. If there are additional problems in the pipeline, please open a new issue and link this one.

jakebeal commented 2 years ago

@breakthewall Side note: if you'd like these tools listed on the SBOL website, you can fill out the SBOL tools form at: https://docs.google.com/forms/d/e/1FAIpQLScOTJLCoTniVPrMh88eg74Eaubh1bFMjncbyG6yt8q4cFLQ-Q/viewform

breakthewall commented 2 years ago

@jakebeal Ok I filled it up for several tools.

SynBioDex / pySBOL

pull malfunctions on Debian and CentOS #138